Getting the text content of a HTML page

Jim Ault JimAultWins at yahoo.com
Tue Aug 5 14:15:22 EDT 2008


On 8/5/08 10:18 AM, "Richard Gaskin" <ambassador at fourthworld.com> wrote:

>  But given its blindingly fast performance and the scope of things it
> handles in well-optimized machine-compiled code in the engine, it seems
> a good starting point for a more complete function which would have
> relatively little other cleanup work to do after using it.


Agreed that the htmlText is the fastest method for removing balanced tags.

It is good to know all the benchmarking results you produce.  I save these
since I want to know, and it is better that the same techniques are used so
the results can be compared.  Thanks for the good info.

What I have needed in my apps is the ability to parse the raw html and
extract certain tags and user visible text, then extract the data.  In other
words, 
-->  data mining.

One example is several charts of stock data shown on a page.  The column
headers are text that is repeated many times on the page, so that particular
text is not good for isolating a particular table, but in almost every case,
the html tags do allow that specificity.

After isolating a table by using the tags, then using the text column
headers makes sure that I will be extracting the correct data, even if the
publisher of the web page moves the columns or tables.  Now I have the
correct values to add to my database.  Of course I do error checking on the
data values before assuming the page is accurate.

Another case of needing the tags is to test if the web server has sent back
a special condition, such as "interrupted, not available, maintenance"

Another case is looking for the absence of tags that mean missing data or
incomplete server delivery.

Jim Ault
Las Vegas





More information about the use-livecode mailing list