How trim: Bug in RegExp engine
Marielle Lange
mlange at lexicall.org
Mon Oct 24 18:40:25 EDT 2005
> The other main issue is that Rev does not support all the fine
> nuances of Perl-style RegEx, though the docs say it does.
A problem is that their documentation doesn't match what their
functions. A table that summarizes the regular expression codes found
in about all programs that implement regular expressions can be seen
at : http://revolution.lexicall.org/wiki/tiki-index.php?
page=RegularExpressions
What is missing in rev doc:
{} The braces force the preceding character to match a
specific number of times.
Ex: (rat){3} matches ratratrat
rat{3} matches rattt rat{2,5} matches ratt or
rattt or ratttt or rattttt (Between 2 and 5 t s)
Though this is implemented:
put "_" & replaceText("AAAAAAA","A{3}","") -> A
put "_" & replaceText("AAAAAAA","A{4}","") -> AAA
put "_" & replaceText("AAAAAAA","A{5}","") -> AA
put "_" & replaceText("AAAAAAA","A{6}","") -> A
There is an error in their documentation:
[ABC]|[XYZ] matches “AY” or “CX”, but not “AA” or “ZB”.
should be:
[ABC][XYZ] matches “AY” or “CX”, but not “AA” or “ZB”. (i.e.,
inappropriate to exemplify "|")
Hopefully, the function behaves normally:
put "AYCXAA" into tTExt; put replacetext(tText, "[ABC][XYZ]", "") -
> AA
put "AYCXAA" into tTExt; put replacetext(tText, "[ABC]|[XYZ]", "") -
> empty
The correct example is
(AY|CX) matches “AY” or “CX”
or a more telling one
(mouse|mice) matches mouse or mice.
> I don't remember the details, but I ran into problems trying to use
> look-around features, for instance. I've come to the conclusion
> that I should try a simple version of what I want first in the
> Message Box, then put it into my script.
I was surprised to see Mark use \s and \S as they are not mentioned
in the documentation (which hasn't been updated to follow updates in
the function in version 2.5). Full information about these special
codes can be found below.
Interestingly, start of text can also be represented by \A and \Z .
They work in revolution and produce still another behaviour.
Honestly, I was pleased to read that regular expressions had been
improved (version 2.6?)... but there are obviously some more problems
to fix.
put "_" & replaceText(" A C","^ *","") -> _A C
put "_" & replaceText("A C","^ *","") -> _C
put "_" & replaceText(" A C","\A ","") -> _A C (space before A C)
put "_" & replaceText("A C","\A ","") -> _A C (no space)
put "_" & replaceText("A C","\A ","") -> _A C (no space)
put "_" & replaceText("A C","\A *","") -> _
put "_" & replaceText(" A C","\A *","") -> _
I tried the edge of word (\B) and this seems to behave strangely as
well:
put "_" & replaceText(" A C","\B *","") -> _A C
put "_" & replaceText(" A C","\b *","") -> _
------------------------------------------------------------------------
------------------------
\b and \B NaV. \b matches the empty string at the
edge of a word; \B matches the empty string if not at
the edge of
a word.
Ex: \bcomput will match "computer" or "computing", but
not
"supercomputer" since there is no spaces or
punctuation between
"super" and "computer". \Bcomput will not match
"computer" or
"computing", unless it is part of a bigger word such as
"supercomputer" or "recomputing".
\w and \W NaV. \w matches word-constituent
characters (letters, "_", & digits); \W matches
characters that
are not word-constituent
Ex: a\wz matches "abz", "aTz", "a5z", "a_z", or any
three-character
string starting with "a", ending with "z", and whose
second character was either a letter (upper-or
lower-case), a number, or the underscore.
a\Wz would not match "abz", "aTz", "a5z", or "a_z". It
would match "a%z", "a z", "a?z" or any three-character
string starting with "a" and ending with "z" and whose
second character was not a letter, number, or
underscore. (This means the second character must
either be a symbol or a whitespace character.)
\d and \D NaV. \d matches any digit. \D matches any
character except a digit.
Ex: a\Dz matches "abz", "aTz" or "a%z", not "a2z",
"a5z" or "a9z".
\D+ matches any non-null string which contains no
numeric characters.
\s and \S NaV. \s matches exactly one character of
whitespace. (Whitespace is defined as spaces, tabs,
newlines, or
any character which would not use ink if printed on a
printer.) \S
matches any character that is not whitespace.
Ex: a\sz would match any three-character string
starting with "a" and ending
with "z" and whose second character was a space, tab,
or newline.
a\Sz would match any three-character string
starting with "a" and
ending with "z" whose second character was not a
space, tab or
newline. (Thus, the second character could be a
letter, number or
symbol.)
\nnn NaV. This is used for specifying control characters
that have no typed
equivalent. For example, \007 would find all subjects
with an embedded ASCII
"bell" character. (The bell is specified by an ASCII
value of 7.) You will
rarely need to use the octal metacharacter.
\A and \Z Beginning and End of string. (equivalents of ^and $)
------------------------------------------------------------------------
--------
Marielle Lange (PhD), Psycholinguist
Alternative emails: mlange at blueyonder.co.uk, M.Lange at ed.ac.uk
Homepage
http://homepages.lexicall.org/mlange/
Easy access to lexical databases http://lexicall.org
Supporting Education Technologists http://
revolution.lexicall.org
More information about the use-livecode
mailing list