Stripping html tags

Jim Ault JimAultWins at yahoo.com
Sat Nov 3 11:53:27 EDT 2007


The 'seriously detailed stripper' was written by Eric, and I made some
adjustments for converting a web page to a formatted data set, therefore
some special lines were added.  I did not post the complete version, since
it was a custom solution.
Sorry about the confusion when the subject was simply 'tag stripping'

Step 1
You correction is not actually the right way:
The function here is to add numtoChar(160) before every tag "<td"
> replace "<td" with  numtochar(160)&"<td" in pHtml
> should be...
> replace "<td"  with numtochar(160)&"td>" in pHtml
so
 replace "<td" with  numtochar(160)&"<td" in pHtml
is intended.  Later numtochar(160) will be replaced with a cr.  In the full
workflow, numtochar(160) will occur for many reasons and in the end stage
all of these will be converted to cr to create a table of the core data.

Step 2
Yes, emailers can morph the tags.
Should have posted between <pre>...</pre> to avoid this.

-----
  replace " " with space in pHtml
  replace "<B"&"R>" with return in pHtml --BR
  replace "<p" &">" with return in pHtml --p tag
  -----
so....
in a web page, white space and returns mean nothing to the browser, except
for the single space.  A run of spaces in a web html document are
interpreted as a single space to the viewer, so we spend a few lines in
transcript converting   to a space char, then dealing with what the
space characters will mean as data separators (eg a table of values).  In
this case, I wanted to convert spaces in part of a web doc to tabs, but
other sections of the document could be discarded, so this worked well for
my app.

In addition, returns mean nothing to a web browser so they can be replaced
with empty.

Also important for me was the specific order of replacements to extract the
data from a web page.

Hope this clarifies some of the gymnastics I went through for tag stripping
and data mining.

Jim Ault
Las Vegas

On 11/3/07 1:48 AM, "FlexibleLearning at aol.com" <FlexibleLearning at aol.com>
wrote:

> 
> This is a seriously detailed stripper, Jim!
>  
> Small error in syntax:
> 
> replace "<td" with  numtochar(160)&"<td" in pHtml
> should be...
> replace "<td"  with numtochar(160)&"td>" in pHtml
>  
> Also, a couple of lines were posted html2Txt-mangled. Could you  clarify:
>   -----
> replace " " with space in pHtml
> replace "
> " with return in pHtml
> replace "
> 
> " with return in pHtml
> -----
> 
> If you post the handler as plain text, any html formatted  text should be
> correctly handled by the emailer.
>  
>  
> /H
> 
> -------------------------------
> -------------------------------------------------
> function  StripTags pHtml
> local tRegex,tPrevText
> get   ("é,à,ç")
> get  it &  (",>,<,ê")
> get  it &  (",è,©,•")
> get  it &  (",',·,&")
> -- add more chars if you wish,  then...
> constant kHtml = it
> constant kConvertedHtml =  "é,à,ç,>,<,ê,è,©"
> --using contants means you cannot  accidentally
> --    modify these vars and damage the  results
> -----  
> replace numtochar(13) with empty in  pHtml
> replace tab with empty in pHtml
> replace "<td" with  numtochar(160)&"<td" in pHtml
> -----
> put  replacetext(pHtml,"(?Usi)<SCRIPT.*</SCRIPT>","") into pHtml
> put replacetext(pHtml,"(?Usi)<STYLE>.*</STYLE>","") into  pHtml
> put replacetext(pHtml,"(?Usi)<\?.*\?>","") into  pHtml
> -----
> replace " " with space in  pHtml
> replace "
> " with return in pHtml
> replace "
> 
> " with return in pHtml
> -----
> put   "<[^><]*>" into tRegex
> put replacetext(pHtml,tRegex,"")  into pHtml
> put replacetext(pHtml,tRegex,"") into pHtml
>  
>   ----- repeat replacements until there are no changes
> repeat until tPrevText is pHtml
> put pHtml into  tPrevText
> put replacetext(pHtml," +",space) into  pHtml
> put replacetext(pHtml,"^ ","") into pHtml
> end repeat
> -----
> replace (space & return) with return in  pHtml
> replace (return & space) with return in pHtml
> filter pHtml without empty
> replace numtochar(160) with empty in  pHtml
> -----
> replace """ with quote in  pHtml
> repeat with i = 1 to the number of items of  kHtml
> replace item i of kHtml with item i of  kConvertedHtml in pHtml
> end repeat
> -----
> --put  pHtml into msg  --let's you see the result in the msg box
> return  pHtml
> end StripTags
> 
> 
> Jim Ault
> Las Vegas
> 
> ------------------------------------------------
> --------------------------------
> 
> 
> 
>    
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution





More information about the use-livecode mailing list