Getting the text content of a HTML page
ambassador at fourthworld.com
Tue Aug 5 13:18:22 EDT 2008
Jim Ault wrote:
> Richard wrote:
>> This function takes care of that, and this far benchmarks about
>> an order of magnitude faster:
>> function HtmlTextMethod pHtml
>> put the properties of the templateField into tSaveProps
>> set the htmlText of the templateField to pHtml
>> get the text of the templateField
>> set the properties of the templateField to tSaveProps
>> return it
>> end HtmlTextMethod
> Caution with this technique in that the Rev tags are noted in the
> documentation to only include a subset of tags.
So far I've had only good results with the htmlText method noted above.
Not only is it blazing fast, but apparently it accounts for all <>
tags, not just the ones the engine generates. For example, <head> and
other non-Rev-generated tags are stripped along with <b> and the rest.
Also, note this difference between the RegEx method and the htmlText
method, using a snipped from a list post:
replaceText(myText,"</?[A-Za-z]+>","") into myText
put the replaceText(myText,"</?[A-Za-z]+>","") into myText
The RegEx version also added a lot more white space to the output, while
the htmlText version preserved the original formatting appearance with
Of course the usefulness of this depends on what you want to do with the
output. If the goal is to strip tags only but leave HTML entities in
place, htmlText is not the answer. But if the goal is to strip HTML to
a form most suitable for display in a field as plain text, the htmlText
method does most of the work for you in just two very efficient lines.
That said, I have no illusions that the htmlText function above will
work for _everything_ that might wind up in a web page or XML document.
But given its blindingly fast performance and the scope of things it
handles in well-optimized machine-compiled code in the engine, it seems
a good starting point for a more complete function which would have
relatively little other cleanup work to do after using it.
Managing Editor, revJournal
Rev tips, tutorials and more: http://www.revJournal.com
More information about the use-livecode