Getting the text content of a HTML page
Eric Chatonet
eric.chatonet at sosmartsoftware.com
Sat Aug 2 10:33:05 EDT 2008
Re,
Le 2 août 08 à 16:31, H Baric a écrit :
> * Get the text only from a web page - no html tags, no formatting etc.
LOL
This is a case that needs some additional code snippet as I said in a
previous email :-)
put StripTags(thePage) into field "The Page"
---------------------------------------------------------
function StripTags pHtml -- returns the meaningful text from a web page
local tRegex,tPrevText
constant kHtml =
"é,à,ç,>,<,ecirc;,è,©,,'
;,·,&"
constant kConvertedHtml = "é,à,ç,>,<,ê,è,©,•,',·,&"
-----
replace return with space in pHtml
replace numtochar(13) with empty in pHtml
replace tab with empty in pHtml
-----
put replacetext(pHtml,"(?Usi)<SCRIPT.*</SCRIPT>","") into pHtml
put replacetext(pHtml,"(?Usi)<STYLE>.*</STYLE>","") into pHtml
put replacetext(pHtml,"(?Usi)<\?.*\?>","") into pHtml
-----
replace " " with space in pHtml
replace "<BR>" with return in pHtml
replace "<p>" with return in pHtml
-----
put "<[^><]*>" into tRegex
put replacetext(pHtml,tRegex,"") into pHtml
put replacetext(pHtml,tRegex,"") into pHtml
-----
repeat until tPrevText is pHtml
put pHtml into tPrevText
put replacetext(pHtml," +",space) into pHtml
put replacetext(pHtml,"^ ","") into pHtml
end repeat
-----
replace (space & return) with return in pHtml
replace (return & space) with return in pHtml
filter pHtml without empty
-----
replace """ with quote in pHtml
repeat with i = 1 to the number of items of kHtml
replace item i of kHtml with item i of kConvertedHtml in pHtml
end repeat
-----
return pHtml
end StripTags
Best regards from Paris,
Eric Chatonet.
----------------------------------------------------------------
Plugins and tutorials for Revolution: http://www.sosmartsoftware.com/
Email: eric.chatonet at sosmartsoftware.com/
----------------------------------------------------------------
More information about the use-livecode
mailing list