MatchWithin

Jim Ault JimAultWins at yahoo.com
Sun Nov 26 16:20:33 EST 2006


On 11/26/06 10:55 AM, "Mark Wieder" <mwieder at ahsoftware.net> wrote
> Yesterday I had a need for a utility function that returns the text
> between two HTML tags, so I cobbled this together, but it will work
> for any text, not just tags. If you have, for example,
> 
> pText = "<a href='x.html'>click here<h3>This is a title</h3></a>"
> 
> you can call
> 
> put MatchWithin(pTtext, "<h3>", "</h3>")
> 
> and get "This is a title"
> 
> Any comments? Can it be made easier? Faster? Is there a better way to
> do this?
> 
> FUNCTION MatchWithin pRawText, pStartText, pEndText
>     local tReturn
>     
>     IF matchText(pRawText, "((?Uis)" & pStartText & "(.+)" & pEndText& ")",
> tReturn) THEN
>         delete char -length(pEndText) to -1 of tReturn
>         delete char 1 to length(pStartText) of tReturn
>     END IF
>     return tReturn
> END MatchWithin

I use the matchChunk function in Rev in different ways.
(Note: change in the regEx from your example)

> put MatchWithin(pTtext, "<h3>", "</h3>")

 put MatchWithinFIRST(pTtext, "<h3>", "</h3>") into msg
--only returns the first match, if any
--no error checking done, see below

 put MatchWithinALL(pTtext, "<h3>", "</h3>") into msg
-- returns all matches concatenated with CR&CR


----------------------------------
FUNCTION MatchWithinFIRST pRawText, pStartText, pEndText
  put "(?Uis)" & pStartText & "(.+)" & pEndText into regEx
  get matchChunk( pRawText, regEx, ch1, ch2)
  return char ch1 to ch2 of pRawText
end MatchWithinFirst

-------------------------------
FUNCTION MatchWithinALL pRawText, pStartText, pEndText
  put "(?Uis)" & pStartText & "(.+)" & pEndText into regEx
  put 0 into ch1
  repeat until ch1 is empty
    get matchChunk( pRawText, regEx, ch1, ch2)
    if ch1 is not empty then
      put char ch1 to ch2 of pRawText &cr&cr after extractedText
      put empty into char ch1 to ch2 of pRawText
      end if
  end repeat
  delete char -2 to -1 of extractedText
--
if extractedText contains pStartText then
  answer "Ooops, there is at least one extra "&cr & pStartText &cr &"string"
end if
--
if pRawText contains pEndText then
  answer "Ooops, there is at least one extra "&cr & pEndText &cr &"string"
end if
--
  return extractedText
end MatchWithinAll

Depending on the source, for most HTML pages an extraction function should
look for unmatched tags and tags that appear between comment tags (thus not
part of the visible page)


Unmatched tags can be detected by:

[1] if the start tag is in any of the extracted segments
[2] if the end tag is in the 'pRawText' after extraction

Hope this adds to the thread

Jim Ault
Las Vegas







More information about the use-livecode mailing list