irregular expression

Jim Ault JimAultWins at yahoo.com
Sun Jul 16 02:06:07 EDT 2006


On 7/14/06 5:05 AM, "tvogelaar at de-mare.nl" <tvogelaar at de-mare.nl> wrote:

> Hi,
> 
> I have a problem with regular expressions. I want to use
> put replacetext(myVar,"</?a[^>]*>","") into myVar
> to delete all hyperlinks form a HTML file. I am totally convinced the regular
> expression is correct since it works correctly in another GREP-enabled
> application (BBEdit). Yet the line of code doesn't change a thing; all
> hyperlink tags are still there. How can this be?

Be careful when using RegEx that you consider ALL the rules you are
implementing, including the defaults.
Therein might lie the answer to you current mystery.
------
Point 1  When I use your expression, it DOES remove the <a> </a> tags from
an HTML page the same way as BBEdit.  Not sure why you don't get the same
result.

---------
Point 2  This is not always the case since BBEdit has some defaults and
symbols  that the PCRE lib in Rev does not...

PCRE definitions
Greedy= longest possible match to the end of the FILE
EndOfLine=ignore end of line (cr) when scanning
CaseSensitive=ignore case

    (?Usi)</?a.*>    ==  shortest match across lines, ignoring case
    (?Ui)</?a.*>    ==  shortest match in the same line, ignoring case
    (?U)</?a.*>    ==  shortest match in the same line, either case
    (?i)</?a.*>    ==  longest match in the same line, ignoring case
    (?si)</?a.*>    ==  longest match across lines, ignoring case

Invoking the shortest match on the same line
BBEdit   </?a.*>  (longest)      </?a.*?>  (shortest)
BBEdit                                          (?U)</?a.*>  (shortest)
BBEdit                                         </?a[^>]*>  (next > char)

Rev        </?a.*>  (longest)      (?U)</?a.*>  (shortest)
Rev                                              </?a[^>]*>  (next > char)


Invoking the shortest match using ALL lines
BBEdit   (?s)</?a.*>  (longest)      (?s)</?a.*?>  (shortest)
Rev        (?s)</?a.*>  (longest)      (?Us)</?a.*>  (shortest)


to be equal, you need to use  the following in Rev

</?a.*>      BBEdit longest match on same line
</?a.*?>      BBEdit shortest match on same line
(?s)</?a.*>      BBEdit first '<' to last '>' across all lines

</?a.*>      Rev longest match on same line
(?U)</?a.*>      Rev shortest match on same line
(?s)</?a.*>      Rev first '<' to last '>' across all lines

Hope this helps.  Also see the info below.
Jim Ault
Las Vegas

----------------------
from the BBEdit docs    --- the '?' is a non-greedy quantifier, which is to
say, DO NOT find the longest match, find the SHORTEST

Non-Greedy Quantifiers

new in 6.5 - To work around this "longest match" behavior, you can modify
your pattern to take advantage of non-greedy quantifiers.

Pattern          Matches...
p+?            one or more p's
p*?            zero or more p's
p??            zero or one p's
p{COUNT}?            match exactly COUNT p's, where COUNT is an integer
p{MIN,}?            match at least MIN p's
p{MIN,MAX}?            match at least MIN p's, but no more than MAX
Astute readers will note that these non-greedy quantifers correspond exactly
to their normal (greedy) counterparts, appended with a question mark.

Revisting our problem of matching HTML tags, for example, we can search for:

    <.+?>
This matches an opening bracket, then one or more occurrences of any
character other than a return, followed by a closing bracket. The non-greedy
quantifer achieves the results we want, preventing BBEdit from "overrunning"
the closing angle bracket and matching across several tags.

A slightly more complicated example: how could you write a pattern that
matches all text between <B> and </B> HTML tags? Consider the sample text
below:

    <B>Welcome</B> to the home of <B>BBEdit!</B>
As before, you might be tempted to write:

    <B>.*</B>
but for the same reasons as before, this will match the entire line of text.
The solution is similar. We'll use the non-greedy *? quantifer:

    <B>.*?</B>





More information about the use-livecode mailing list