Regex to remove all tags from a web page

Mon Oct 31 20:04:54 EST 2005

Eric Chatonet wrote:
>
> I searched the list archive and the net for a regex that would allow
> to retrieve the meaningful text from any web page, stripping all html
> tags, extra code, etc. but I did not find something really  convincing
> :-(
> Any help would be much appreciated :-)

I have cast a few 'data mining' scripts with regex, but tailor them for more
than just removing tags.  Are you also trying to format strings (text,
paragraphs) or data (tables, labeled values)?

Specifics are important.  One example is that a page of accounting data that
has been working great for 3.5 months, now has a glitch since the authors
changed the web page format.

tip: Check to see if </HTML> is in the text, which means that the download
was complete, whenever it occurred.
top: Convert all returns to "MMMM" so that now there is only one line
between ^ and $ (since returns mean nothing in html, why deal with empties
and multiple empties?)

One step you should try to incorporate is a 'back check'... does the result
have enough/too many characters, does it contain "<" or ">", are key words
present/absent.

tip: Replacing some tags with a tab char means that you can copy/paste the
block into a spreadsheet and see where the columns are and excess to be
trimmed.

Send a page or two my way and I will see if something I have conjured will
work for you.  I'll just toss it in my caldron and see what bubbles to the
top. </Halloween ref>

Jim Ault
Las Vegas