How trim: Bug in RegExp engine
Mark Greenberg
markgreenberg at cox.net
Sun Oct 23 15:18:39 EDT 2005
On Oct 23, 2005, at 10:00 AM, Thomas Fischer wrote:
> * matches zero or more occurrences of the preceding character or
> pattern
>
> I assumed that Revolution would do what it promised and didn't
> check this.
>
> Try
> answer replaceText("A C","^ *","")
> I get "C", which obviously is not correct.
> If I remove the "*", I get "A C"
Though it's academic now since Bob has his solutions, this isn't a
Rev bug; it's the way Regular Expressions work (or fail to in this
case). The problem is in the greediness of the * quantifier. Though
I can't say that I totally understand why, in cases where the RegEx
reduces to nothing after the optional parts are removed, matching
with either the ? or the * quantifiers causes unexpected results,
regardless of whether the RegEx is in Perl, egrep, or wherever. This
is because the RegEx engine continues to try to find a match (to
nothingness, I guess), consumes the entire string, and then
backtracks giving up one character at a time. Why "C" instead of "A
C"? I don't know, but my RegEx reference book (Mastering Regular
Expressions by Jeffrey E. F. Friedl) does warn against such
constructions as "^ *" with a lengthy explanation about greediness of
the * and ? quantifiers.
Evidently this has been a known issue for a long time to those who
use RegEx, so most modern versions of Regular Expressions include
what they call "lazy" quantifiers. Whereas * has this problem, *?
(the lazy twin) does not. Same with ? and ??. So
answer replaceText("A C","^ *?","")
does work as expected.
Mark Greenberg
More information about the use-livecode
mailing list