How trim: Bug in RegExp engine

Marielle Lange mlange at lexicall.org
Mon Oct 24 18:40:25 EDT 2005


> The other main issue is that Rev does not support all the fine  
> nuances of Perl-style RegEx, though the docs say it does.

A problem is that their documentation doesn't match what their  
functions. A table that summarizes the regular expression codes found  
in about all programs that implement regular expressions can be seen  
at : http://revolution.lexicall.org/wiki/tiki-index.php? 
page=RegularExpressions

What is missing in rev doc:
{} The braces force the preceding character to match a
           specific number of times.
           Ex:  (rat){3}    matches ratratrat
            rat{3}    matches rattt  rat{2,5} matches ratt or
           rattt or ratttt or rattttt (Between 2 and 5 t s)

Though this is implemented:
put "_" & replaceText("AAAAAAA","A{3}","")   -> A
put "_" & replaceText("AAAAAAA","A{4}","")   -> AAA
put "_" & replaceText("AAAAAAA","A{5}","")   -> AA
put "_" & replaceText("AAAAAAA","A{6}","")   -> A

There is an error in their documentation:
[ABC]|[XYZ] matches “AY” or “CX”, but not “AA” or “ZB”.
should be:
[ABC][XYZ] matches “AY” or “CX”, but not “AA” or “ZB”.  (i.e.,  
inappropriate to exemplify "|")
Hopefully, the function behaves normally:
put "AYCXAA" into tTExt; put replacetext(tText, "[ABC][XYZ]", "") - 
 >   AA
put "AYCXAA" into tTExt; put replacetext(tText, "[ABC]|[XYZ]", "")  - 
 > empty

The correct example is
(AY|CX)   matches “AY” or “CX”
or a more telling one
(mouse|mice) matches mouse or mice.

> I don't remember the details, but I ran into problems trying to use  
> look-around features, for instance.  I've come to the conclusion  
> that I should try a simple version of what I want first in the  
> Message Box, then put it into my script.

I was surprised to see Mark use \s and \S as they are not mentioned  
in the documentation (which hasn't been updated to follow updates in  
the function in version 2.5). Full information about these special  
codes can be found below.

Interestingly, start of text can also be represented by \A and \Z .  
They work in revolution and produce still another behaviour.  
Honestly, I was pleased to read that regular expressions had been  
improved (version 2.6?)... but there are obviously some more problems  
to fix.

put "_" & replaceText(" A C","^ *","")  -> _A C
put "_" & replaceText("A C","^ *","")   -> _C

put "_" & replaceText(" A C","\A ","")   -> _A C   (space before A C)
put "_" & replaceText("A C","\A ","")     -> _A C   (no space)
put "_" & replaceText("A C","\A ","")     -> _A C   (no space)
put "_" & replaceText("A C","\A *","")    -> _
put "_" & replaceText(" A C","\A *","")    -> _


I tried the edge of word (\B) and this seems to behave strangely as  
well:

put "_" & replaceText(" A C","\B *","")   -> _A C
put "_" & replaceText(" A C","\b *","")   -> _

------------------------------------------------------------------------ 
------------------------

  \b and \B    NaV. \b matches the empty string at the
               edge of a word; \B matches the empty string if not at  
the edge of
               a word.
               Ex: \bcomput will match "computer" or "computing", but  
not
               "supercomputer" since there is no spaces or  
punctuation between
               "super" and "computer". \Bcomput will not match  
"computer" or
               "computing", unless it is part of a bigger word such as
               "supercomputer" or "recomputing".

  \w and \W    NaV. \w matches word-constituent
               characters (letters, "_", & digits); \W matches  
characters that
               are not word-constituent
               Ex:  a\wz matches "abz", "aTz", "a5z", "a_z", or any  
three-character
              string starting with "a", ending with "z", and whose
              second character was either a letter (upper-or
              lower-case), a number, or the underscore.
              a\Wz would not match "abz", "aTz", "a5z", or "a_z". It
              would match "a%z", "a z", "a?z" or any three-character
              string starting with "a" and ending with "z" and whose
              second character was not a letter, number, or
              underscore. (This means the second character must
              either be a symbol or a whitespace character.)

  \d and \D    NaV. \d matches any digit. \D matches any
                   character except a digit.
               Ex:  a\Dz matches "abz", "aTz" or "a%z", not "a2z",  
"a5z" or "a9z".
                    \D+ matches any non-null string which contains no  
numeric characters.

  \s and \S    NaV. \s matches exactly one character of
               whitespace. (Whitespace is defined as spaces, tabs,  
newlines, or
               any character which would not use ink if printed on a  
printer.) \S
               matches any character that is not whitespace.
               Ex: a\sz would match any three-character string  
starting with "a" and ending
               with "z" and whose second character was a space, tab,  
or newline.
                   a\Sz would match any three-character string  
starting with "a" and
                   ending with "z" whose second character was not a  
space, tab or
                   newline. (Thus, the second character could be a  
letter, number or
                   symbol.)

  \nnn         NaV. This is used for specifying control characters  
that have no typed
               equivalent. For example, \007 would find all subjects  
with an embedded ASCII
               "bell" character. (The bell is specified by an ASCII  
value of 7.) You will
               rarely need to use the octal metacharacter.

  \A and \Z    Beginning and End of string. (equivalents of ^and $)


------------------------------------------------------------------------ 
--------
Marielle Lange (PhD),  Psycholinguist

Alternative emails: mlange at blueyonder.co.uk, M.Lange at ed.ac.uk
Homepage                                                            
http://homepages.lexicall.org/mlange/
Easy access to lexical databases                    http://lexicall.org
Supporting Education Technologists              http:// 
revolution.lexicall.org





More information about the use-livecode mailing list