MatchWithin
Jim Ault
JimAultWins at yahoo.com
Sun Nov 26 16:20:33 EST 2006
On 11/26/06 10:55 AM, "Mark Wieder" <mwieder at ahsoftware.net> wrote
> Yesterday I had a need for a utility function that returns the text
> between two HTML tags, so I cobbled this together, but it will work
> for any text, not just tags. If you have, for example,
>
> pText = "<a href='x.html'>click here<h3>This is a title</h3></a>"
>
> you can call
>
> put MatchWithin(pTtext, "<h3>", "</h3>")
>
> and get "This is a title"
>
> Any comments? Can it be made easier? Faster? Is there a better way to
> do this?
>
> FUNCTION MatchWithin pRawText, pStartText, pEndText
> local tReturn
>
> IF matchText(pRawText, "((?Uis)" & pStartText & "(.+)" & pEndText& ")",
> tReturn) THEN
> delete char -length(pEndText) to -1 of tReturn
> delete char 1 to length(pStartText) of tReturn
> END IF
> return tReturn
> END MatchWithin
I use the matchChunk function in Rev in different ways.
(Note: change in the regEx from your example)
> put MatchWithin(pTtext, "<h3>", "</h3>")
put MatchWithinFIRST(pTtext, "<h3>", "</h3>") into msg
--only returns the first match, if any
--no error checking done, see below
put MatchWithinALL(pTtext, "<h3>", "</h3>") into msg
-- returns all matches concatenated with CR&CR
----------------------------------
FUNCTION MatchWithinFIRST pRawText, pStartText, pEndText
put "(?Uis)" & pStartText & "(.+)" & pEndText into regEx
get matchChunk( pRawText, regEx, ch1, ch2)
return char ch1 to ch2 of pRawText
end MatchWithinFirst
-------------------------------
FUNCTION MatchWithinALL pRawText, pStartText, pEndText
put "(?Uis)" & pStartText & "(.+)" & pEndText into regEx
put 0 into ch1
repeat until ch1 is empty
get matchChunk( pRawText, regEx, ch1, ch2)
if ch1 is not empty then
put char ch1 to ch2 of pRawText &cr&cr after extractedText
put empty into char ch1 to ch2 of pRawText
end if
end repeat
delete char -2 to -1 of extractedText
--
if extractedText contains pStartText then
answer "Ooops, there is at least one extra "&cr & pStartText &cr &"string"
end if
--
if pRawText contains pEndText then
answer "Ooops, there is at least one extra "&cr & pEndText &cr &"string"
end if
--
return extractedText
end MatchWithinAll
Depending on the source, for most HTML pages an extraction function should
look for unmatched tags and tags that appear between comment tags (thus not
part of the visible page)
Unmatched tags can be detected by:
[1] if the start tag is in any of the extracted segments
[2] if the end tag is in the 'pRawText' after extraction
Hope this adds to the thread
Jim Ault
Las Vegas
More information about the use-livecode
mailing list