Stripping html tags

Dave dave at looktowindward.com
Sun Nov 4 10:33:16 EST 2007


Hi,

I am having a problem getting this to compile, do you think you could  
email it in a stack? Or post it somewhere and I'll download it.

Thanks a lot
All the Best
Dave

On 3 Nov 2007, at 15:53, Jim Ault wrote:

> The 'seriously detailed stripper' was written by Eric, and I made some
> adjustments for converting a web page to a formatted data set,  
> therefore
> some special lines were added.  I did not post the complete  
> version, since
> it was a custom solution.
> Sorry about the confusion when the subject was simply 'tag stripping'
>
> Step 1
> You correction is not actually the right way:
> The function here is to add numtoChar(160) before every tag "<td"
>> replace "<td" with  numtochar(160)&"<td" in pHtml
>> should be...
>> replace "<td"  with numtochar(160)&"td>" in pHtml
> so
>  replace "<td" with  numtochar(160)&"<td" in pHtml
> is intended.  Later numtochar(160) will be replaced with a cr.  In  
> the full
> workflow, numtochar(160) will occur for many reasons and in the end  
> stage
> all of these will be converted to cr to create a table of the core  
> data.
>
> Step 2
> Yes, emailers can morph the tags.
> Should have posted between <pre>...</pre> to avoid this.
>
> -----
>   replace " " with space in pHtml
>   replace "<B"&"R>" with return in pHtml --BR
>   replace "<p" &">" with return in pHtml --p tag
>   -----
> so....
> in a web page, white space and returns mean nothing to the browser,  
> except
> for the single space.  A run of spaces in a web html document are
> interpreted as a single space to the viewer, so we spend a few  
> lines in
> transcript converting   to a space char, then dealing with  
> what the
> space characters will mean as data separators (eg a table of  
> values).  In
> this case, I wanted to convert spaces in part of a web doc to tabs,  
> but
> other sections of the document could be discarded, so this worked  
> well for
> my app.
>
> In addition, returns mean nothing to a web browser so they can be  
> replaced
> with empty.
>
> Also important for me was the specific order of replacements to  
> extract the
> data from a web page.
>
> Hope this clarifies some of the gymnastics I went through for tag  
> stripping
> and data mining.
>
> Jim Ault
> Las Vegas
>
> On 11/3/07 1:48 AM, "FlexibleLearning at aol.com"  
> <FlexibleLearning at aol.com>
> wrote:
>
>>
>> This is a seriously detailed stripper, Jim!
>>
>> Small error in syntax:
>>
>> replace "<td" with  numtochar(160)&"<td" in pHtml
>> should be...
>> replace "<td"  with numtochar(160)&"td>" in pHtml
>>
>> Also, a couple of lines were posted html2Txt-mangled. Could you   
>> clarify:
>>   -----
>> replace " " with space in pHtml
>> replace "
>> " with return in pHtml
>> replace "
>>
>> " with return in pHtml
>> -----
>>
>> If you post the handler as plain text, any html formatted  text  
>> should be
>> correctly handled by the emailer.
>>
>>
>> /H
>>
>> -------------------------------
>> -------------------------------------------------
>> function  StripTags pHtml
>> local tRegex,tPrevText
>> get   ("é,à,ç")
>> get  it &  (",>,<,ê")
>> get  it &  (",è,©,•")
>> get  it &  (",',·,&")
>> -- add more chars if you wish,  then...
>> constant kHtml = it
>> constant kConvertedHtml =  "é,à,ç,>,<,ê,è,©"
>> --using contants means you cannot  accidentally
>> --    modify these vars and damage the  results
>> -----
>> replace numtochar(13) with empty in  pHtml
>> replace tab with empty in pHtml
>> replace "<td" with  numtochar(160)&"<td" in pHtml
>> -----
>> put  replacetext(pHtml,"(?Usi)<SCRIPT.*</SCRIPT>","") into pHtml
>> put replacetext(pHtml,"(?Usi)<STYLE>.*</STYLE>","") into  pHtml
>> put replacetext(pHtml,"(?Usi)<\?.*\?>","") into  pHtml
>> -----
>> replace " " with space in  pHtml
>> replace "
>> " with return in pHtml
>> replace "
>>
>> " with return in pHtml
>> -----
>> put   "<[^><]*>" into tRegex
>> put replacetext(pHtml,tRegex,"")  into pHtml
>> put replacetext(pHtml,tRegex,"") into pHtml
>>
>>   ----- repeat replacements until there are no changes
>> repeat until tPrevText is pHtml
>> put pHtml into  tPrevText
>> put replacetext(pHtml," +",space) into  pHtml
>> put replacetext(pHtml,"^ ","") into pHtml
>> end repeat
>> -----
>> replace (space & return) with return in  pHtml
>> replace (return & space) with return in pHtml
>> filter pHtml without empty
>> replace numtochar(160) with empty in  pHtml
>> -----
>> replace """ with quote in  pHtml
>> repeat with i = 1 to the number of items of  kHtml
>> replace item i of kHtml with item i of  kConvertedHtml in pHtml
>> end repeat
>> -----
>> --put  pHtml into msg  --let's you see the result in the msg box
>> return  pHtml
>> end StripTags
>>
>>
>> Jim Ault
>> Las Vegas
>>
>> ------------------------------------------------
>> --------------------------------
>>
>>
>>
>>
>> _______________________________________________
>> use-revolution mailing list
>> use-revolution at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your  
>> subscription
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-revolution
>
>
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your  
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution




More information about the use-livecode mailing list