Regex to remove all tags from a web page

xavier.bury at clearstream.com xavier.bury at clearstream.com
Mon Oct 31 09:04:25 EST 2005


Hi Alex

Since i've had crashes with regex with big files, and since it can't 
handle line breaks which
happen only too often within html code, i used offsets(). 

When i wrote the discreteBrowser, it was the ONLY reliable method to get 
the html tidy or
out... 

put 0 into a
repeat
  put offset("<", txt, a) into a
  if a < 1 then exit repeat
  put offset(">",txt,a+1) into b
  delete char a to b of txt
end repeat

or something like that. It's quick and infallible... 

cheers
Xavier
http://monsieurx.com/taoo


use-revolution-bounces at lists.runrev.com wrote on 31/10/2005 12:19:43:

> Hi all,
> 
> I searched the list archive and the net for a regex that would allow 
> to retrieve the meaningful text from any web page, stripping all html 
> tags, extra code, etc. but I did not find something really 
> convincing :-(
> Any help would be much appreciated :-)
> 
> PS. I don't want to use "set the htmlText/get text" using a field: 
> this way crashes Rev unpredictably when doing batch processing.
> 
> Best Regards from Paris,
> 
> Eric Chatonet.
> ----------------------------------------------------------------
> So Smart Software
> 
> For institutions, companies and associations
> Built-to-order applications: management, multimedia, internet, etc.
> Windows, Mac OS and Linux... With the French touch
> 
> Free plugins and tutorials on my website
> ----------------------------------------------------------------
> Web site        http://www.sosmartsoftware.com/
> Email        eric.chatonet at sosmartsoftware.com/
> Phone        33 (0)1 43 31 77 62
> Mobile        33 (0)6 20 74 50 86
> ----------------------------------------------------------------
> 
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your 
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution



-----------------------------------------
To make communications with Clearstream easier, Clearstream has
recently changed the email address format to conform with industry
standards. The new format is 'firstname.familyname at clearstream.com'.

Visit us at http://www.clearstream.com

IMPORTANT MESSAGE

Internet communications are not secure and therefore Clearstream
International does not accept legal responsibility for the contents of
this message.

The information contained in this e-mail is confidential and may be
legally privileged. It is intended solely for the addressee. If you are
not the intended recipient, any disclosure, copying, distribution or
any action taken or omitted to be taken in reliance on it, is
prohibited and may be unlawful. Any views expressed in this e-mail are
those of the individual sender, except where the sender specifically
states them to be the views of Clearstream International or of any of
its affiliates or subsidiaries.

END OF DISCLAIMER



More information about the use-livecode mailing list