Regex to remove all tags from a web page

Alex Tweedly alex at tweedly.net
Mon Oct 31 09:33:44 EST 2005


xavier.bury at clearstream.com wrote:

>Hi Alex
>
>Since i've had crashes with regex with big files, and since it can't 
>handle line breaks which
>happen only too often within html code, i used offsets(). 
>
>  
>
I haven't seen any crashes with Regex - but of course that doesn't mean 
they can't happen :-)

But the regex I sent certainly does work with line breaks.

>When i wrote the discreteBrowser, it was the ONLY reliable method to get 
>the html tidy or
>out... 
>
>put 0 into a
>repeat
>  put offset("<", txt, a) into a
>  if a < 1 then exit repeat
>  put offset(">",txt,a+1) into b
>  delete char a to b of txt
>end repeat
>
>or something like that. It's quick and infallible... 
>
>  
>
Actually, it's not infallible. It would fail on Eric's code for the same 
reason my first try did. He has javascript code which can contain "<" or 
">", so the regex exits half-way through the "comment" that contains the 
javascript source code.

Easy to fix in your offset scheme (and probably easy to fix in regex as 
well, but I don't know how to do it yet).
Time to read some more about regex ...


-- 
Alex Tweedly       http://www.tweedly.net



-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.361 / Virus Database: 267.12.6/151 - Release Date: 28/10/2005




More information about the use-livecode mailing list