How trim: Bug in RegExp engine

Marielle Lange mlange at lexicall.org
Tue Oct 25 06:06:08 EDT 2005


The concept of "greediness of the *" has been introduced. Let's  
expand. What this means is that when you parse any html or xml file,  
you have to be very careful if you know a same tag can occur many  
times in your document.

Simple example:
The <b> cat</b> under the <b>table</b> is...

if you use:
put replacetext(tText, "<b>.*</b>", "")

This will give you :
The  is...
because * tries to match as many characters as possible.

The way to handle this in php is to add a "?" after the *, to  
specifically indicate you want the "*" to be as ungreedy as possible
http://uk.php.net/manual/en/reference.pcre.pattern.modifiers.php

> U (PCRE_UNGREEDY)
> This modifier inverts the "greediness" of the quantifiers so that  
> they are not greedy by default, but become greedy if followed by  
> "?". It is not compatible with Perl. It can also be set by a (?U)  
> modifier setting within the pattern or by a question mark behind a  
> quantifier (e.g. .*?).
>
So, let's try:
put replacetext(tText, "<b>.*?</b>","")

He he, this gives the correct result:
The  under the  is...

------------------------------------------------------------------------ 
--------
Marielle Lange (PhD),  Psycholinguist

Alternative emails: mlange at blueyonder.co.uk, M.Lange at ed.ac.uk
Homepage                                                            
http://homepages.lexicall.org/mlange/
Easy access to lexical databases                    http://lexicall.org
Supporting Education Technologists              http:// 
revolution.lexicall.org






More information about the use-livecode mailing list