Regex to remove all tags from a web page
Eric Chatonet
eric.chatonet at sosmartsoftware.com
Mon Oct 31 08:51:30 EST 2005
Hi Alex,
Thanks a lot.
That's a first good step since the out text is about 20/30% of the in
text :-)
HTML tags are stripped but extra code (php,java, etc.) of course
remains.
Any ideas for these ones?
Le 31 oct. 05 à 13:00, Alex Tweedly a écrit :
> Eric Chatonet wrote:
>
>
>> Hi all,
>>
>> I searched the list archive and the net for a regex that would
>> allow to retrieve the meaningful text from any web page,
>> stripping all html tags, extra code, etc. but I did not find
>> something really convincing :-(
>> Any help would be much appreciated :-)
>>
>> PS. I don't want to use "set the htmlText/get text" using a
>> field: this way crashes Rev unpredictably when doing batch
>> processing.
>>
>>
> I suspect this will be "not really convincing" :-)
>
> Just removing tags should be
>
>> put "<[^><]*>" into tRex
>> put replacetext(fld "in", tRex, "") into fld "out"
>
> That assumes the html has no "<" or ">" , and is generally well-
> formed.
> That seems too simple - so it can't be convincing :-)
>
> Alex Tweedly http://www.tweedly.net
Best Regards from Paris,
Eric Chatonet.
----------------------------------------------------------------
So Smart Software
For institutions, companies and associations
Built-to-order applications: management, multimedia, internet, etc.
Windows, Mac OS and Linux... With the French touch
Free plugins and tutorials on my website
----------------------------------------------------------------
Web site http://www.sosmartsoftware.com/
Email eric.chatonet at sosmartsoftware.com/
Phone 33 (0)1 43 31 77 62
Mobile 33 (0)6 20 74 50 86
----------------------------------------------------------------
More information about the use-livecode
mailing list