Regex to remove all tags from a web page

xavier.bury at clearstream.com xavier.bury at clearstream.com
Mon Oct 31 10:06:31 EST 2005


Alex


The trick i used in my discrete browser was to treat <scripts> and 
<styles> first.

String anything between those two tags and you're a lot safer. Eventually 
if you need those back, keep them out of the tag stripping operation and 
put them back after.

Then deal with the html structure...

cheers
Xavier


use-revolution-bounces at lists.runrev.com wrote on 31/10/2005 15:33:44:

> xavier.bury at clearstream.com wrote:
> 
> >Hi Alex
> >
> >Since i've had crashes with regex with big files, and since it can't 
> >handle line breaks which
> >happen only too often within html code, i used offsets(). 
> >
> > 
> >
> I haven't seen any crashes with Regex - but of course that doesn't mean 
> they can't happen :-)
> 
> But the regex I sent certainly does work with line breaks.
> 
> >When i wrote the discreteBrowser, it was the ONLY reliable method to 
get 
> >the html tidy or
> >out... 
> >
> >put 0 into a
> >repeat
> >  put offset("<", txt, a) into a
> >  if a < 1 then exit repeat
> >  put offset(">",txt,a+1) into b
> >  delete char a to b of txt
> >end repeat
> >
> >or something like that. It's quick and infallible... 
> >
> > 
> >
> Actually, it's not infallible. It would fail on Eric's code for the same 

> reason my first try did. He has javascript code which can contain "<" or 

> ">", so the regex exits half-way through the "comment" that contains the 

> javascript source code.
> 
> Easy to fix in your offset scheme (and probably easy to fix in regex as 
> well, but I don't know how to do it yet).
> Time to read some more about regex ...
> 
> 
> -- 
> Alex Tweedly       http://www.tweedly.net
> 
> 
> 
> -- 
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.1.361 / Virus Database: 267.12.6/151 - Release Date: 
28/10/2005
> 
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your 
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution



-----------------------------------------
To make communications with Clearstream easier, Clearstream has
recently changed the email address format to conform with industry
standards. The new format is 'firstname.familyname at clearstream.com'.

Visit us at http://www.clearstream.com

IMPORTANT MESSAGE

Internet communications are not secure and therefore Clearstream
International does not accept legal responsibility for the contents of
this message.

The information contained in this e-mail is confidential and may be
legally privileged. It is intended solely for the addressee. If you are
not the intended recipient, any disclosure, copying, distribution or
any action taken or omitted to be taken in reliance on it, is
prohibited and may be unlawful. Any views expressed in this e-mail are
those of the individual sender, except where the sender specifically
states them to be the views of Clearstream International or of any of
its affiliates or subsidiaries.

END OF DISCLAIMER



More information about the use-livecode mailing list