Getting the text content of a HTML page
Eric Chatonet
eric.chatonet at sosmartsoftware.com
Sat Aug 2 12:44:45 EDT 2008
Hi Heather,
So simple ;-)
I don't think that, some day, there will be a built-in function for
retrieving plain text from html because html evolves every day and as
you pointed it out, the code snippet I sent does not take CSS into
account: it was written three years ago :-(
If you enhance it, please share ;-)
The way that Jacque was talking about is interesting when you want to
parse the text 'invisibly' but is not satisfying for displaying it.
Unfortunately: it would so simple :-)
Good luck and prefer knitting code than sweaters :-)
Le 2 août 08 à 17:19, H Baric a écrit :
> Well, you know I could have thought of that!
> So simple and obvious really isn't it!
> I mean, I could have just asked my two year old instead!
>
> :-o
> :-|
>
> Well, I was going to just take myself to bed when I saw all that
> code, but
> at least I could understand it, and so decided to just tried it out...
>
> And it works except - all the CSS remains! (Anyone ever heard of
> linked
> stylesheets sheesh!)
>
> So rather than add a million more lines to the script (would it
> ever be
> complete!), I'm thinking I shall give up for now, at least until
> tomorrow
> when I am well slept, and can think up nice little incomplicated
> things to
> create for the purpose of keeping the old brain cells alive.
>
> Thanks for your help again Eric.
>
> Heather, who is determined to be a programmer when she grows up.
> At 36yrs though, she is wondering if she should just stick to
> knitting.
> on knitOne ; select chunk of wool ; tie it in a knot ; create
> noose ; end
> knitOne
>
> ----- Original Message -----
> From: "Eric Chatonet" <eric.chatonet at sosmartsoftware.com>
> To: "How to use Revolution" <use-revolution at lists.runrev.com>
> Sent: Sunday, August 03, 2008 12:33 AM
> Subject: Re: Getting the text content of a HTML page
>
>
> Re,
>
> Le 2 août 08 à 16:31, H Baric a écrit :
>
>> * Get the text only from a web page - no html tags, no formatting
>> etc.
>
> LOL
> This is a case that needs some additional code snippet as I said in a
> previous email :-)
>
> put StripTags(thePage) into field "The Page"
> ---------------------------------------------------------
> function StripTags pHtml -- returns the meaningful text from a web
> page
> local tRegex,tPrevText
> constant kHtml =
> "é,à,ç,>,<,ecirc;,è,©,,&#
> 39
> ;,·,&"
> constant kConvertedHtml = "é,à,ç,>,<,ê,è,©,•,',·,&"
> -----
> replace return with space in pHtml
> replace numtochar(13) with empty in pHtml
> replace tab with empty in pHtml
> -----
> put replacetext(pHtml,"(?Usi)<SCRIPT.*</SCRIPT>","") into pHtml
> put replacetext(pHtml,"(?Usi)<STYLE>.*</STYLE>","") into pHtml
> put replacetext(pHtml,"(?Usi)<\?.*\?>","") into pHtml
> -----
> replace " " with space in pHtml
> replace "<BR>" with return in pHtml
> replace "<p>" with return in pHtml
> -----
> put "<[^><]*>" into tRegex
> put replacetext(pHtml,tRegex,"") into pHtml
> put replacetext(pHtml,tRegex,"") into pHtml
> -----
> repeat until tPrevText is pHtml
> put pHtml into tPrevText
> put replacetext(pHtml," +",space) into pHtml
> put replacetext(pHtml,"^ ","") into pHtml
> end repeat
> -----
> replace (space & return) with return in pHtml
> replace (return & space) with return in pHtml
> filter pHtml without empty
> -----
> replace """ with quote in pHtml
> repeat with i = 1 to the number of items of kHtml
> replace item i of kHtml with item i of kConvertedHtml in pHtml
> end repeat
> -----
> return pHtml
> end StripTags
>
> Best regards from Paris,
> Eric Chatonet.
Best regards from Paris,
Eric Chatonet.
----------------------------------------------------------------
Plugins and tutorials for Revolution: http://www.sosmartsoftware.com/
Email: eric.chatonet at sosmartsoftware.com/
----------------------------------------------------------------
More information about the use-livecode
mailing list