Getting the text content of a HTML page

Eric Chatonet eric.chatonet at sosmartsoftware.com
Sat Aug 2 12:44:45 EDT 2008


Hi Heather,

So simple ;-)
I don't think that, some day, there will be a built-in function for  
retrieving plain text from html because html evolves every day and as  
you pointed it out, the code snippet I sent does not take CSS into  
account: it was written three years ago :-(
If you enhance it, please share ;-)
The way that Jacque was talking about is interesting when you want to  
parse the text 'invisibly' but is not satisfying for displaying it.
Unfortunately: it would so simple :-)

Good luck and prefer knitting code than sweaters :-)

Le 2 août 08 à 17:19, H Baric a écrit :

> Well, you know I could have thought of that!
> So simple and obvious really isn't it!
> I mean, I could have just asked my two year old instead!
>
> :-o
> :-|
>
> Well, I was going to just take myself to bed when I saw all that  
> code, but
> at least I could understand it, and so decided to just tried it out...
>
> And it works except - all the CSS remains! (Anyone ever heard of  
> linked
> stylesheets sheesh!)
>
> So rather than add a million more lines to the script (would it  
> ever be
> complete!), I'm thinking I shall give up for now, at least until  
> tomorrow
> when I am well slept, and can think up nice little incomplicated  
> things to
> create for the purpose of keeping the old brain cells alive.
>
> Thanks for your help again Eric.
>
> Heather, who is determined to be a programmer when she grows up.
> At 36yrs though, she is wondering if she should just stick to  
> knitting.
> on knitOne ; select chunk of wool ; tie it in a knot ; create  
> noose ; end
> knitOne
>
> ----- Original Message -----
> From: "Eric Chatonet" <eric.chatonet at sosmartsoftware.com>
> To: "How to use Revolution" <use-revolution at lists.runrev.com>
> Sent: Sunday, August 03, 2008 12:33 AM
> Subject: Re: Getting the text content of a HTML page
>
>
> Re,
>
> Le 2 août 08 à 16:31, H Baric a écrit :
>
>> * Get the text only from a web page - no html tags, no formatting  
>> etc.
>
> LOL
> This is a case that needs some additional code snippet as I said in a
> previous email :-)
>
> put StripTags(thePage) into field "The Page"
> ---------------------------------------------------------
> function StripTags pHtml -- returns the meaningful text from a web  
> page
>    local tRegex,tPrevText
>    constant kHtml =
> "é,à,ç,>,<,ecirc;,è,©,•,&# 
> 39
> ;,·,&"
>    constant kConvertedHtml = "é,à,ç,>,<,ê,è,©,•,',·,&"
>    -----
>    replace return with space in pHtml
>    replace numtochar(13) with empty in pHtml
>    replace tab with empty in pHtml
>    -----
>    put replacetext(pHtml,"(?Usi)<SCRIPT.*</SCRIPT>","") into pHtml
>    put replacetext(pHtml,"(?Usi)<STYLE>.*</STYLE>","") into pHtml
>    put replacetext(pHtml,"(?Usi)<\?.*\?>","") into pHtml
>    -----
>    replace " " with space in pHtml
>    replace "<BR>" with return in pHtml
>    replace "<p>" with return in pHtml
>    -----
>    put  "<[^><]*>" into tRegex
>    put replacetext(pHtml,tRegex,"") into pHtml
>    put replacetext(pHtml,tRegex,"") into pHtml
>    -----
>    repeat until tPrevText is pHtml
>      put pHtml into tPrevText
>      put replacetext(pHtml," +",space) into pHtml
>      put replacetext(pHtml,"^ ","") into pHtml
>    end repeat
>    -----
>    replace (space & return) with return in pHtml
>    replace (return & space) with return in pHtml
>    filter pHtml without empty
>    -----
>    replace """ with quote in pHtml
>    repeat with i = 1 to the number of items of kHtml
>      replace item i of kHtml with item i of kConvertedHtml in pHtml
>    end repeat
>    -----
>    return pHtml
> end StripTags
>
> Best regards from Paris,
> Eric Chatonet.



Best regards from Paris,
Eric Chatonet.
----------------------------------------------------------------
Plugins and tutorials for Revolution: http://www.sosmartsoftware.com/
Email: eric.chatonet at sosmartsoftware.com/
----------------------------------------------------------------





More information about the use-livecode mailing list