How trim: Bug in RegExp engine

Jim Ault JimAultWins at yahoo.com
Sun Oct 23 16:45:41 EDT 2005


Hi Mark,

I am flag you are tossing this question into the mix.
My work is basically surface level application and is largely
trial-and-error using BBEdit to see what different expressions will yield on
a group of similar strings.

I thought that 'greediness' was to produce the longest possible match, and
the 'ungreedy (?U)' command was to have regex resolve to the shortest match.
For me, the use of symbols in sequence has remained somewhat of a mystery
since the grep engine works in more than one direction throughout a string.

I would like to understand this particular issue a bit better, so I might do
a conditioned test where 'regex(flavor)|string>result' would simply be a
table of results based on common tasks.

Perhaps we could compile a list of examples and alternatives.  The reason I
am interested is that I am using the MatchText, etc to parse some web page
data, and there can always be unexpected results.  The nature of my project
means that I really want to avoid the unexpected.

Grepping html or other code is always more of a challenge than plain English
prose or database tables.

Jim Ault
Las Vegas


On 10/23/05 12:18 PM, "Mark Greenberg" <markgreenberg at cox.net> wrote:

> 
> On Oct 23, 2005, at 10:00 AM, Thomas Fischer wrote:
> 
>> * matches zero or more occurrences of the preceding character or
>> pattern
>> 
>> I assumed that Revolution would do what it promised and didn't
>> check this.
>> 
>> Try
>> answer replaceText("A C","^ *","")
>> I get "C", which obviously is not correct.
>> If I remove the "*", I get "A C"
> 
> Though it's academic now since Bob has his solutions, this isn't a
> Rev bug; it's the way Regular Expressions work (or fail to in this
> case).  The problem is in the greediness of the * quantifier.  Though
> I can't say that I totally understand why, in cases where the RegEx
> reduces to nothing after the optional parts are removed, matching
> with either the ? or the * quantifiers causes unexpected results,
> regardless of whether the RegEx is in Perl, egrep, or wherever.  This
> is because the RegEx engine continues to try to find a match (to
> nothingness, I guess), consumes the entire string, and then
> backtracks giving up one character at a time.  Why "C" instead of "A
> C"?  I don't know, but my RegEx reference book (Mastering Regular
> Expressions by Jeffrey E. F. Friedl) does warn against such
> constructions as "^ *" with a lengthy explanation about greediness of
> the * and ? quantifiers.
> 
> Evidently this has been a known issue for a long time to those who
> use RegEx, so most modern versions of Regular Expressions include
> what they call "lazy" quantifiers.  Whereas * has this problem, *?
> (the lazy twin) does not.  Same with ? and ??.  So
> 
> answer replaceText("A C","^ *?","")
> 
> does work as expected.
> 
>        Mark Greenberg
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution





More information about the use-livecode mailing list