RegEx Help--Across Lines

Jim Ault JimAultWins at yahoo.com
Wed May 4 21:55:49 EDT 2005


Just reviewed old posts and found this one about RegEx parsing HTML 
and cr (return) characters.  I wanted to pass along a little trick I 
found useful to extract tables into tab delimited format.

Premise:  An HTML document is formatted with spaces and cr's for the 
benefit of the programmer.  Basically the browser app ignores this 
white space in such a way that an HTML table will display correctly 
even with extra characters, such as multiple cr's, are sprinkled 
about.

The (?s) is good for searching past cr's, but it can make a 
difference if you wish to end up with a single cr defining a table 
row, rather than 2 or 3 or 4 cr's.  Also, this cr specific.

One of my first steps is to replace cr with string "MMMM".  This 
makes the entire block of HTML text a single line and no need for 
(?s),  It also makes spurious cr's easily identifiable by subsequent 
search commands, not to mention easily visible when checking your 
results. BBEdit in softwrap mode allows you to see all of the text 
even without the returns.

Of course, you could simply replace cr with "" in htmlTextBlock and 
there is no need for (?s) either.  The browser will display the same 
page, with or without the returns present.

As I mine data from HTML I find it useful to re-establish cr's at 
specific points, thus the MMMM replacement allows me to reinsert cr's 
where desired and use loops that "repeat for the number of lines" for 
patterned data blocks.


Further   MMM[M]+   will locate all MMMM or longer, no matter how 
many cr's were in a row,
using
get matchChunk(temp,"(MMM[M]+)", startChar, endChar)  ==> 4 to howevermany

-------------thus------
    put fld htmlTextToParse into temp

    put "z" into startChar
    repeat until startChar = ""
       -- note: the startChar and endChar vars do not have to be 
defined before matchChunk
       get matchChunk(temp,"(MMM[M]+)", startChar, endChar)
       -->you have to use parens in the regex string
       put return into char startChar to endChar of temp
       put return & startChar & "," & endChar after temp  -->for demo 
purposes only
    end repeat
    put return & "startChar, endChar list " after temp  -->for demo 
purposes only
    put temp  --> view the replacement, and the char list at the bottom

  -------------------will convert a run of cr's to a single cr.

Nested tables can be problematic, but I find that this technique 
allows me to establish my true output cr's in the cacophony of HTML 
source code formatting.

Hope this helps those who need to learn a bit more about the power of 
RegEx and Rev

Jim Ault
Las Vegas


>On 11/20/04 8:20 PM, "Sivakatirswami" <katir at hindu.org> wrote:
>
>>  I am using Rev to repurpose old html to new CSS compliant mark up. The
>>  old pages are incredibly inconsistent.  Fortunately grep is our
>  > friend.. I need a grep expression that will pass out the content from
>  >
>  > both #1:
>>
>>  <title> some title </title>
>>
>>  and #2
>>
>>  <title> some title
>>  </title>
>  >
>  > where the first instance has no line break but the second one does
>
>Use the "(?s)" directive:
>
>on mouseUp
>   local tTitle
>   put "<title>some title"&cr&"</title>" into tXML
>   get matchText(tXML,"(?s)title>(.*?)</title",tTItle)
>   put tTitle
>end mouseUp
>
>Note that you'll get the trailing CR after "some title" as well, so you'd
>have to strip that out if you want to.
>
>Check the docs at http://www.prce.org/man.txt - the "?s" directive
>corresponts to PCRE_DOTALL, which causes the "." character to match all
>characters, including newlines (CRs).
>
>HTH,
>
>Ken Ray
>Sons of Thunder Software
>Web site: http://www.sonsothunder.com/
>Email: kray at sonsothunder.com
>
>
>_______________________________________________
>use-revolution mailing list
>use-revolution at lists.runrev.com
>http://lists.runrev.com/mailman/listinfo/use-revolution



More information about the use-livecode mailing list