Regex to remove all tags from a web page
xavier.bury at clearstream.com
xavier.bury at clearstream.com
Mon Oct 31 09:04:25 EST 2005
Hi Alex
Since i've had crashes with regex with big files, and since it can't
handle line breaks which
happen only too often within html code, i used offsets().
When i wrote the discreteBrowser, it was the ONLY reliable method to get
the html tidy or
out...
put 0 into a
repeat
put offset("<", txt, a) into a
if a < 1 then exit repeat
put offset(">",txt,a+1) into b
delete char a to b of txt
end repeat
or something like that. It's quick and infallible...
cheers
Xavier
http://monsieurx.com/taoo
use-revolution-bounces at lists.runrev.com wrote on 31/10/2005 12:19:43:
> Hi all,
>
> I searched the list archive and the net for a regex that would allow
> to retrieve the meaningful text from any web page, stripping all html
> tags, extra code, etc. but I did not find something really
> convincing :-(
> Any help would be much appreciated :-)
>
> PS. I don't want to use "set the htmlText/get text" using a field:
> this way crashes Rev unpredictably when doing batch processing.
>
> Best Regards from Paris,
>
> Eric Chatonet.
> ----------------------------------------------------------------
> So Smart Software
>
> For institutions, companies and associations
> Built-to-order applications: management, multimedia, internet, etc.
> Windows, Mac OS and Linux... With the French touch
>
> Free plugins and tutorials on my website
> ----------------------------------------------------------------
> Web site http://www.sosmartsoftware.com/
> Email eric.chatonet at sosmartsoftware.com/
> Phone 33 (0)1 43 31 77 62
> Mobile 33 (0)6 20 74 50 86
> ----------------------------------------------------------------
>
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
-----------------------------------------
To make communications with Clearstream easier, Clearstream has
recently changed the email address format to conform with industry
standards. The new format is 'firstname.familyname at clearstream.com'.
Visit us at http://www.clearstream.com
IMPORTANT MESSAGE
Internet communications are not secure and therefore Clearstream
International does not accept legal responsibility for the contents of
this message.
The information contained in this e-mail is confidential and may be
legally privileged. It is intended solely for the addressee. If you are
not the intended recipient, any disclosure, copying, distribution or
any action taken or omitted to be taken in reliance on it, is
prohibited and may be unlawful. Any views expressed in this e-mail are
those of the individual sender, except where the sender specifically
states them to be the views of Clearstream International or of any of
its affiliates or subsidiaries.
END OF DISCLAIMER
More information about the use-livecode
mailing list