Getting the text content of a HTML page

Eric Chatonet eric.chatonet at
Sat Aug 2 10:33:05 EDT 2008


Le 2 août 08 à 16:31, H Baric a écrit :

> * Get the text only from a web page - no html tags, no formatting etc.

This is a case that needs some additional code snippet as I said in a  
previous email :-)

put StripTags(thePage) into field "The Page"
function StripTags pHtml -- returns the meaningful text from a web page
   local tRegex,tPrevText
   constant kHtml =  
   constant kConvertedHtml = "é,à,ç,>,<,ê,è,©,•,',·,&"
   replace return with space in pHtml
   replace numtochar(13) with empty in pHtml
   replace tab with empty in pHtml
   put replacetext(pHtml,"(?Usi)<SCRIPT.*</SCRIPT>","") into pHtml
   put replacetext(pHtml,"(?Usi)<STYLE>.*</STYLE>","") into pHtml
   put replacetext(pHtml,"(?Usi)<\?.*\?>","") into pHtml
   replace " " with space in pHtml
   replace "<BR>" with return in pHtml
   replace "<p>" with return in pHtml
   put  "<[^><]*>" into tRegex
   put replacetext(pHtml,tRegex,"") into pHtml
   put replacetext(pHtml,tRegex,"") into pHtml
   repeat until tPrevText is pHtml
     put pHtml into tPrevText
     put replacetext(pHtml," +",space) into pHtml
     put replacetext(pHtml,"^ ","") into pHtml
   end repeat
   replace (space & return) with return in pHtml
   replace (return & space) with return in pHtml
   filter pHtml without empty
   replace """ with quote in pHtml
   repeat with i = 1 to the number of items of kHtml
     replace item i of kHtml with item i of kConvertedHtml in pHtml
   end repeat
   return pHtml
end StripTags

Best regards from Paris,
Eric Chatonet.
Plugins and tutorials for Revolution:
Email: eric.chatonet at

More information about the use-livecode mailing list