Regex to remove all tags from a web page

Alex Tweedly alex at tweedly.net
Mon Oct 31 07:00:20 EST 2005


Eric Chatonet wrote:

> Hi all,
>
> I searched the list archive and the net for a regex that would allow  
> to retrieve the meaningful text from any web page, stripping all html  
> tags, extra code, etc. but I did not find something really  convincing 
> :-(
> Any help would be much appreciated :-)
>
> PS. I don't want to use "set the htmlText/get text" using a field:  
> this way crashes Rev unpredictably when doing batch processing.
>
I suspect this will be "not really convincing" :-)

Just removing tags should be

>     put  "<[^><]*>" into tRex
>    
>     put replacetext(fld "in", tRex, "") into fld "out"

That assumes the html has no "<" or ">" , and is generally well-formed.

That seems too simple - so it can't be convincing :-)


-- 
Alex Tweedly       http://www.tweedly.net



-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.361 / Virus Database: 267.12.6/151 - Release Date: 28/10/2005




More information about the use-livecode mailing list