remove html tags from text

Jim Ault JimAultWins at yahoo.com
Sat Sep 9 01:09:16 EDT 2006


On 9/8/06 7:57 PM, "Richard Gaskin" <ambassador at fourthworld.com> wrote:

> Jim Ault wrote:
> 
>> Cubist  is correct.  Any well-formed page will have balanced tags and only
>> use the < and > chars to mean tag markers.
> 
> But can one deliver a product which assumes all the html thrown at it
> will be well-formed?
> 
> So I have two questions about the sort of variable-based methods for
> filtering SGML-style tags and using a field object to so the same:
> 
> 1. Which is more forgiving of html which may not be well-formed?
> 
> 2. Which is faster?

My quick comment is to consider the sources of data and the intended use.
If gathering content for human review, many tools are possible to build that
could refine 'raw content' and even index it using a controlled vocabulary.

In my projects, the data needs to be mined and honed without human
intervention, so the 'smart' functions need to be applied judiciously.
Further, since my text blocks are small, I can afford to do more elaborate
steps that involve RegEx and error checking.  Additionally, if any data is
suspicious, it can be discarded with little penalty.

> 1. Which is more forgiving of html which may not be well-formed?
I would favor a decision tree that applied specific rules, found exceptions
and tried to react to them, thus variable-based methods with elaborate
parsing rules.  So many html sources are generated by database engines these
days, that errors which make no difference to the viewer will be propagated
throughout a site.  This means that 'bad' html tags to a parser like you
make no diff to a site manager, thus there is no reason to fix them.

A parallel is the use of OCR (optical character recognition) software.  How
fast do you want to go to get to 85% correct... 95% correct.. then have a
reviewer do the final editing?

If you are repeatedly mining the same sites (eg news agencies, competitors)
then it is easier.  Random authors/sites become more difficult.

Hope this gives you a bit of my opinion, but everyone's mileage can and will
vary.

Jim Ault
Las Vegas






More information about the use-livecode mailing list