HTML Tags and muliline regular expresions.

David Bovill david at openpartnership.net
Wed Aug 9 14:49:49 EDT 2006


OK - here is what I have got so far.

First I gave up on the multiline thing... for now I just replaced all
lineFeeeds with empty - still would like to know how to do this longer term.
This is my function:

function html_ExtractTagContents tagName, someHtml
>     -- get the first one only
>     -- using white space char "\s*" all over the place
>
>     local tagContents -- not sure if it is still required
>
>     put "<\s*" & tagName & "\s+name=[^>]*>(.*)<\s*/\s*" & tagName & "\s*>"
> into someReg
>     -- put "(?m)" before someReg -- does not seem to have an effect
>     replace lineFeed with empty in someHtml -- seems neessary
>
>     if matchText(someHtml, someReg, tagContents) is false then
>         return empty
>     else
>         return tagContents
>     end if
> end html_ExtractTagContents
>

Any improvements - especially how to do the multiline thing properly?



For reference the following extracts were taken from the prce manText at:
http://www.pcre.org/man.txt

Some RegExp Info

There are two different sets of metacharacters: those that are recog-
> nized anywhere in the pattern except within square brackets, and those
> that are recognized in square brackets. Outside square brackets, the
> metacharacters are as follows:
>
> \ general escape character with several uses
> ^ assert start of string (or line, in multiline mode)
> $ assert end of string (or line, in multiline mode)
> . match any character except newline (by default)
> [ start character class definition
> | start of alternative branch
> ( start subpattern
> ) end subpattern
> ? extends the meaning of (
> also 0 or 1 quantifier
> also quantifier minimizer
> * 0 or more quantifier
> + 1 or more quantifier
> also "possessive quantifier"
> { start min/max quantifier
>
> Part of a pattern that is in square brackets is called a "character
> class". In a character class the only metacharacters are:
>
> \ general escape character
> ^ negate the class, but only if the first character
> - indicates character range
> [ POSIX character class (only if followed by POSIX
> syntax)
> ] terminates the character class
>


Non-printing characters

\d any decimal digit
\D any character that is not a decimal digit
\s any whitespace character
\S any character that is not a whitespace character
\w any "word" character
\W any "non-word" character

Non-printing characters

\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any character
\e escape (hex 1B)
\f formfeed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\ddd character with octal code ddd, or backreference
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..

The backslashed assertions are:

\b matches at a word boundary
\B matches when not at a word boundary
\A matches at start of subject
\Z matches at end of subject or before newline at end
\z matches at end of subject
\G matches at first matching position in subject

INTERNAL OPTION SETTING

The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
PCRE_EXTENDED options can be changed  from  within  the  pattern  by  a
sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
option letters are

i for PCRE_CASELESS
> m for PCRE_MULTILINE
> s for PCRE_DOTALL
> x for PCRE_EXTENDED
>

For example, (?im) sets caseless, multiline matching. It is also possi-
ble to unset these options by preceding the letter with a hyphen, and a
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
is also permitted. If a letter appears both before and after the
hyphen, the option is unset.



More information about the use-livecode mailing list