RegEx Help--Across Lines
Jim Ault
JimAultWins at yahoo.com
Wed May 4 21:55:49 EDT 2005
Just reviewed old posts and found this one about RegEx parsing HTML
and cr (return) characters. I wanted to pass along a little trick I
found useful to extract tables into tab delimited format.
Premise: An HTML document is formatted with spaces and cr's for the
benefit of the programmer. Basically the browser app ignores this
white space in such a way that an HTML table will display correctly
even with extra characters, such as multiple cr's, are sprinkled
about.
The (?s) is good for searching past cr's, but it can make a
difference if you wish to end up with a single cr defining a table
row, rather than 2 or 3 or 4 cr's. Also, this cr specific.
One of my first steps is to replace cr with string "MMMM". This
makes the entire block of HTML text a single line and no need for
(?s), It also makes spurious cr's easily identifiable by subsequent
search commands, not to mention easily visible when checking your
results. BBEdit in softwrap mode allows you to see all of the text
even without the returns.
Of course, you could simply replace cr with "" in htmlTextBlock and
there is no need for (?s) either. The browser will display the same
page, with or without the returns present.
As I mine data from HTML I find it useful to re-establish cr's at
specific points, thus the MMMM replacement allows me to reinsert cr's
where desired and use loops that "repeat for the number of lines" for
patterned data blocks.
Further MMM[M]+ will locate all MMMM or longer, no matter how
many cr's were in a row,
using
get matchChunk(temp,"(MMM[M]+)", startChar, endChar) ==> 4 to howevermany
-------------thus------
put fld htmlTextToParse into temp
put "z" into startChar
repeat until startChar = ""
-- note: the startChar and endChar vars do not have to be
defined before matchChunk
get matchChunk(temp,"(MMM[M]+)", startChar, endChar)
-->you have to use parens in the regex string
put return into char startChar to endChar of temp
put return & startChar & "," & endChar after temp -->for demo
purposes only
end repeat
put return & "startChar, endChar list " after temp -->for demo
purposes only
put temp --> view the replacement, and the char list at the bottom
-------------------will convert a run of cr's to a single cr.
Nested tables can be problematic, but I find that this technique
allows me to establish my true output cr's in the cacophony of HTML
source code formatting.
Hope this helps those who need to learn a bit more about the power of
RegEx and Rev
Jim Ault
Las Vegas
>On 11/20/04 8:20 PM, "Sivakatirswami" <katir at hindu.org> wrote:
>
>> I am using Rev to repurpose old html to new CSS compliant mark up. The
>> old pages are incredibly inconsistent. Fortunately grep is our
> > friend.. I need a grep expression that will pass out the content from
> >
> > both #1:
>>
>> <title> some title </title>
>>
>> and #2
>>
>> <title> some title
>> </title>
> >
> > where the first instance has no line break but the second one does
>
>Use the "(?s)" directive:
>
>on mouseUp
> local tTitle
> put "<title>some title"&cr&"</title>" into tXML
> get matchText(tXML,"(?s)title>(.*?)</title",tTItle)
> put tTitle
>end mouseUp
>
>Note that you'll get the trailing CR after "some title" as well, so you'd
>have to strip that out if you want to.
>
>Check the docs at http://www.prce.org/man.txt - the "?s" directive
>corresponts to PCRE_DOTALL, which causes the "." character to match all
>characters, including newlines (CRs).
>
>HTH,
>
>Ken Ray
>Sons of Thunder Software
>Web site: http://www.sonsothunder.com/
>Email: kray at sonsothunder.com
>
>
>_______________________________________________
>use-revolution mailing list
>use-revolution at lists.runrev.com
>http://lists.runrev.com/mailman/listinfo/use-revolution
More information about the use-livecode
mailing list