Getting the text content of a HTML page
Richard Gaskin
ambassador at fourthworld.com
Mon Aug 4 14:25:26 EDT 2008
Jim Ault wrote:
> The problem with this may be that it only looks for alpha chars,
> not spaces or numbers, quotes or equal signs
> therefore it finds less matches depending on the html
>
> oops, these don't match and won't be replaced with empty --------------
> <img src="somebody.jpg" width="160">
> <img src="somebody.jpg" width="160" />
> <div class="mainFormat">
> <table cellpadding="" width=100%">
> <b />
> <hr />
>
> works on this tag -------------
> <B>Making this bold</B>
>
> put "" into newString
> put "(?U)<.*> into regEx
> put replaceText(myText,regEx,newString) into myText
I put that into this function:
function RegexMethod pHtml
put "" into newString
put "(?U)<.*>" into regEx
return replaceText(pHtml,regEx,newString)
end RegexMethod
...and then ran it on the HTML source for this page:
<http://mail.runrev.com/pipermail/use-revolution/2008-August/113074.html>
It catches just about everything except for the mailto near the top:
<A
HREF="mailto:use-revolution%40lists.runrev.com?Subject=Getting%20the%20text%20content%20of%20a%20HTML%20page&In-Reply-To=f99b52860808031334l44f6cd1by6ed2444fb32560ac%40mail.gmail.com"
TITLE="Getting the text content of a HTML page">
Presumably this is because that tag is broken onto two lines.
This function takes care of that, and this far benchmarks about an order
of magnitude faster:
function HtmlTextMethod pHtml
put the properties of the templateField into tSaveProps
set the htmlText of the templateField to pHtml
get the text of the templateField
set the properties of the templateField to tSaveProps
return it
end HtmlTextMethod
--
Richard Gaskin
Managing Editor, revJournal
_______________________________________________________
Rev tips, tutorials and more: http://www.revJournal.com
More information about the use-livecode
mailing list