Regular expressions and html: html_ExtractAllLinks

David Bovill david at openpartnership.net
Wed Aug 9 15:48:38 EDT 2006


Experiment in the world of regular expressions - this one got 5 stars and is
simple. Not tested much - see if it works for you?

function html_ExtractAllLinks someHtml
    /*
    based on http://regexlib.com/REDetails.aspx?regexp_id=774

    Pattern:  href[\s]*=[\s]*"[^\n"]*"
    RegExp Author:  Tony Hawe
    Matching Text:    href
="http://www.theregister.com/"|||href="http://theregister.co.uk"|||hre
    Non-Matching Text:    href=http://theregister.co.uk
    Description:    A very short pattern for extracting hrefs from HTML,
does not validate they are within a tag
    */

    local urlIndex
    replace lineFeed with empty in someHtml -- seems necessary

    put "href[\s]*=[\s]*'([^\n']*)'" into someReg
    replace "'" with quote in someReg -- for now to make regExp readable

    repeat
        if matchChunk(someHtml, someReg, startCharNum, endCharNum) is false
then
            delete last char of urlIndex
            return urlIndex
        else
            put char startCharNum to endCharNum of someHtml into someUrl
            put someUrl & CR after urlIndex
            delete char 1 to endCharNum + 1 of someHtml
        end if
    end repeat
end html_ExtractAllLinks



More information about the use-livecode mailing list