Getting the text content of a HTML page

Richard Gaskin ambassador at fourthworld.com
Tue Aug 5 13:18:22 EDT 2008


Jim Ault wrote:

 > Richard wrote:
 >> This function takes care of that, and this far benchmarks about
 >> an order of magnitude faster:
 >>
 >> function HtmlTextMethod pHtml
 >>    put the properties of the templateField into tSaveProps
 >>    set the htmlText of the templateField to pHtml
 >>    get the text of the templateField
 >>    set the properties of the templateField to tSaveProps
 >>    return it
 >> end HtmlTextMethod
 >
 > Caution with this technique in that the Rev tags are noted in the
 > documentation to only include a subset of tags.

So far I've had only good results with the htmlText method noted above. 
  Not only is it blazing fast, but apparently it accounts for all <> 
tags, not just the ones the engine generates.  For example, <head> and 
other non-Rev-generated tags are stripped along with <b> and the rest.

Also, note this difference between the RegEx method and the htmlText 
method, using a snipped from a list post:

RegEx result:

    put the 
replaceText(myText,"</?[A-Za-z]+>","") into myText


htmlText result:

put the replaceText(myText,"</?[A-Za-z]+>","") into myText

The RegEx version also added a lot more white space to the output, while 
the htmlText version preserved the original formatting appearance with 
greater fidelity.

Of course the usefulness of this depends on what you want to do with the 
output.  If the goal is to strip tags only but leave HTML entities in 
place, htmlText is not the answer.  But if the goal is to strip HTML to 
a form most suitable for display in a field as plain text, the htmlText 
method does most of the work for you in just two very efficient lines.

That said, I have no illusions that the htmlText function above will 
work for _everything_ that might wind up in a web page or XML document. 
  But given its blindingly fast performance and the scope of things it 
handles in well-optimized machine-compiled code in the engine, it seems 
a good starting point for a more complete function which would have 
relatively little other cleanup work to do after using it.

-- 
  Richard Gaskin
  Managing Editor, revJournal
  _______________________________________________________
  Rev tips, tutorials and more: http://www.revJournal.com




More information about the use-livecode mailing list