How trim: Bug in RegExp engine

Mark Greenberg markgreenberg at cox.net
Sun Oct 23 15:18:39 EDT 2005


On Oct 23, 2005, at 10:00 AM, Thomas Fischer wrote:

> * matches zero or more occurrences of the preceding character or  
> pattern
>
> I assumed that Revolution would do what it promised and didn't  
> check this.
>
> Try
> answer replaceText("A C","^ *","")
> I get "C", which obviously is not correct.
> If I remove the "*", I get "A C"

Though it's academic now since Bob has his solutions, this isn't a  
Rev bug; it's the way Regular Expressions work (or fail to in this  
case).  The problem is in the greediness of the * quantifier.  Though  
I can't say that I totally understand why, in cases where the RegEx  
reduces to nothing after the optional parts are removed, matching  
with either the ? or the * quantifiers causes unexpected results,  
regardless of whether the RegEx is in Perl, egrep, or wherever.  This  
is because the RegEx engine continues to try to find a match (to  
nothingness, I guess), consumes the entire string, and then  
backtracks giving up one character at a time.  Why "C" instead of "A  
C"?  I don't know, but my RegEx reference book (Mastering Regular  
Expressions by Jeffrey E. F. Friedl) does warn against such  
constructions as "^ *" with a lengthy explanation about greediness of  
the * and ? quantifiers.

Evidently this has been a known issue for a long time to those who  
use RegEx, so most modern versions of Regular Expressions include  
what they call "lazy" quantifiers.  Whereas * has this problem, *?  
(the lazy twin) does not.  Same with ? and ??.  So

answer replaceText("A C","^ *?","")

does work as expected.

       Mark Greenberg


More information about the use-livecode mailing list