Getting the text content of a HTML page

Richard Gaskin ambassador at fourthworld.com
Mon Aug 4 14:25:26 EDT 2008


Jim Ault wrote:

> The problem with this may be that it only looks for alpha chars,
> not spaces or numbers, quotes or equal signs
> therefore it finds less matches depending on the html
> 
> oops, these don't match and won't be replaced with empty  --------------
> <img src="somebody.jpg" width="160">
> <img src="somebody.jpg" width="160" />
> <div class="mainFormat">
> <table cellpadding="" width=100%">
> <b />
> <hr />
> 
> works on this tag  -------------
> <B>Making this bold</B>
> 
> put "" into newString
> put "(?U)<.*> into regEx
> put replaceText(myText,regEx,newString) into myText

I put that into this function:

function RegexMethod pHtml
   put "" into newString
   put "(?U)<.*>" into regEx
   return replaceText(pHtml,regEx,newString)
end RegexMethod

...and then ran it on the HTML source for this page:

<http://mail.runrev.com/pipermail/use-revolution/2008-August/113074.html>

It catches just about everything except for the mailto near the top:

  <A 
HREF="mailto:use-revolution%40lists.runrev.com?Subject=Getting%20the%20text%20content%20of%20a%20HTML%20page&In-Reply-To=f99b52860808031334l44f6cd1by6ed2444fb32560ac%40mail.gmail.com"
        TITLE="Getting the text content of a HTML page">

Presumably this is because that tag is broken onto two lines.

This function takes care of that, and this far benchmarks about an order 
of magnitude faster:


function HtmlTextMethod pHtml
   put the properties of the templateField into tSaveProps
   set the htmlText of the templateField to pHtml
   get the text of the templateField
   set the properties of the templateField to tSaveProps
   return it
end HtmlTextMethod


-- 
  Richard Gaskin
  Managing Editor, revJournal
  _______________________________________________________
  Rev tips, tutorials and more: http://www.revJournal.com



More information about the use-livecode mailing list