Getting the text content of a HTML page

Jim Ault JimAultWins at yahoo.com
Mon Aug 4 14:59:05 EDT 2008


Richard, a couple additional notes:

On 8/4/08 11:25 AM, "Richard Gaskin" <ambassador at fourthworld.com> wrote:

> I put that into this function:
> 
> function RegexMethod pHtml
>    put "" into newString
>    put "(?U)<.*>" into regEx
>    return replaceText(pHtml,regEx,newString)
> end RegexMethod
> 
> ...and then ran it on the HTML source for this page:
> 
> <http://mail.runrev.com/pipermail/use-revolution/2008-August/113074.html>
> 
> It catches just about everything except for the mailto near the top:
> 
>   <A 
> HREF="mailto:use-revolution%40lists.runrev.com?Subject=Getting%20the%20text%20
> content%20of%20a%20HTML%20page&In-Reply-To=f99b52860808031334l44f6cd1by6ed2444
> fb32560ac%40mail.gmail.com"
>         TITLE="Getting the text content of a HTML page">
> 
> Presumably this is because that tag is broken onto two lines.
> 

Try this variation for the regEx           put "(?Us)<.*>" into regEx
the 's' says 'ignore end of line characters to make the match'

> This function takes care of that, and this far benchmarks about an order
> of magnitude faster:
> 
> 
> function HtmlTextMethod pHtml
>    put the properties of the templateField into tSaveProps
>    set the htmlText of the templateField to pHtml
>    get the text of the templateField
>    set the properties of the templateField to tSaveProps
>    return it
> end HtmlTextMethod

Caution with this technique in that the Rev tags are noted in the
documentation to only include a subset of tags.  In today's world of XML,
programmers will create their own versions.  Perhaps a little catch line or
two:

if it contains "<" then
   answer "There may be an extra tag or two remaining in the text"
   answer "Please inspect the result to be sure"
end if

Of course, the killer in this exercise is when the text we want has
something line "Tip: solve for x > y, then add the point to your graph"

Fun, games, and work I between.

Jim Ault
Las Vegas





More information about the use-livecode mailing list