Regex to remove all tags from a web page
Alex Tweedly
alex at tweedly.net
Mon Oct 31 09:33:44 EST 2005
xavier.bury at clearstream.com wrote:
>Hi Alex
>
>Since i've had crashes with regex with big files, and since it can't
>handle line breaks which
>happen only too often within html code, i used offsets().
>
>
>
I haven't seen any crashes with Regex - but of course that doesn't mean
they can't happen :-)
But the regex I sent certainly does work with line breaks.
>When i wrote the discreteBrowser, it was the ONLY reliable method to get
>the html tidy or
>out...
>
>put 0 into a
>repeat
> put offset("<", txt, a) into a
> if a < 1 then exit repeat
> put offset(">",txt,a+1) into b
> delete char a to b of txt
>end repeat
>
>or something like that. It's quick and infallible...
>
>
>
Actually, it's not infallible. It would fail on Eric's code for the same
reason my first try did. He has javascript code which can contain "<" or
">", so the regex exits half-way through the "comment" that contains the
javascript source code.
Easy to fix in your offset scheme (and probably easy to fix in regex as
well, but I don't know how to do it yet).
Time to read some more about regex ...
--
Alex Tweedly http://www.tweedly.net
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.361 / Virus Database: 267.12.6/151 - Release Date: 28/10/2005
More information about the use-livecode
mailing list