Regular expressions and html: html_ExtractAllLinks
David Bovill
david at openpartnership.net
Wed Aug 9 15:48:38 EDT 2006
Experiment in the world of regular expressions - this one got 5 stars and is
simple. Not tested much - see if it works for you?
function html_ExtractAllLinks someHtml
/*
based on http://regexlib.com/REDetails.aspx?regexp_id=774
Pattern: href[\s]*=[\s]*"[^\n"]*"
RegExp Author: Tony Hawe
Matching Text: href
="http://www.theregister.com/"|||href="http://theregister.co.uk"|||hre
Non-Matching Text: href=http://theregister.co.uk
Description: A very short pattern for extracting hrefs from HTML,
does not validate they are within a tag
*/
local urlIndex
replace lineFeed with empty in someHtml -- seems necessary
put "href[\s]*=[\s]*'([^\n']*)'" into someReg
replace "'" with quote in someReg -- for now to make regExp readable
repeat
if matchChunk(someHtml, someReg, startCharNum, endCharNum) is false
then
delete last char of urlIndex
return urlIndex
else
put char startCharNum to endCharNum of someHtml into someUrl
put someUrl & CR after urlIndex
delete char 1 to endCharNum + 1 of someHtml
end if
end repeat
end html_ExtractAllLinks
More information about the use-livecode
mailing list