Stripping html tags
Dave
dave at looktowindward.com
Sun Nov 4 10:33:16 EST 2007
Hi,
I am having a problem getting this to compile, do you think you could
email it in a stack? Or post it somewhere and I'll download it.
Thanks a lot
All the Best
Dave
On 3 Nov 2007, at 15:53, Jim Ault wrote:
> The 'seriously detailed stripper' was written by Eric, and I made some
> adjustments for converting a web page to a formatted data set,
> therefore
> some special lines were added. I did not post the complete
> version, since
> it was a custom solution.
> Sorry about the confusion when the subject was simply 'tag stripping'
>
> Step 1
> You correction is not actually the right way:
> The function here is to add numtoChar(160) before every tag "<td"
>> replace "<td" with numtochar(160)&"<td" in pHtml
>> should be...
>> replace "<td" with numtochar(160)&"td>" in pHtml
> so
> replace "<td" with numtochar(160)&"<td" in pHtml
> is intended. Later numtochar(160) will be replaced with a cr. In
> the full
> workflow, numtochar(160) will occur for many reasons and in the end
> stage
> all of these will be converted to cr to create a table of the core
> data.
>
> Step 2
> Yes, emailers can morph the tags.
> Should have posted between <pre>...</pre> to avoid this.
>
> -----
> replace " " with space in pHtml
> replace "<B"&"R>" with return in pHtml --BR
> replace "<p" &">" with return in pHtml --p tag
> -----
> so....
> in a web page, white space and returns mean nothing to the browser,
> except
> for the single space. A run of spaces in a web html document are
> interpreted as a single space to the viewer, so we spend a few
> lines in
> transcript converting to a space char, then dealing with
> what the
> space characters will mean as data separators (eg a table of
> values). In
> this case, I wanted to convert spaces in part of a web doc to tabs,
> but
> other sections of the document could be discarded, so this worked
> well for
> my app.
>
> In addition, returns mean nothing to a web browser so they can be
> replaced
> with empty.
>
> Also important for me was the specific order of replacements to
> extract the
> data from a web page.
>
> Hope this clarifies some of the gymnastics I went through for tag
> stripping
> and data mining.
>
> Jim Ault
> Las Vegas
>
> On 11/3/07 1:48 AM, "FlexibleLearning at aol.com"
> <FlexibleLearning at aol.com>
> wrote:
>
>>
>> This is a seriously detailed stripper, Jim!
>>
>> Small error in syntax:
>>
>> replace "<td" with numtochar(160)&"<td" in pHtml
>> should be...
>> replace "<td" with numtochar(160)&"td>" in pHtml
>>
>> Also, a couple of lines were posted html2Txt-mangled. Could you
>> clarify:
>> -----
>> replace " " with space in pHtml
>> replace "
>> " with return in pHtml
>> replace "
>>
>> " with return in pHtml
>> -----
>>
>> If you post the handler as plain text, any html formatted text
>> should be
>> correctly handled by the emailer.
>>
>>
>> /H
>>
>> -------------------------------
>> -------------------------------------------------
>> function StripTags pHtml
>> local tRegex,tPrevText
>> get ("é,à,ç")
>> get it & (",>,<,ê")
>> get it & (",è,©,")
>> get it & (",',·,&")
>> -- add more chars if you wish, then...
>> constant kHtml = it
>> constant kConvertedHtml = "é,à,ç,>,<,ê,è,©"
>> --using contants means you cannot accidentally
>> -- modify these vars and damage the results
>> -----
>> replace numtochar(13) with empty in pHtml
>> replace tab with empty in pHtml
>> replace "<td" with numtochar(160)&"<td" in pHtml
>> -----
>> put replacetext(pHtml,"(?Usi)<SCRIPT.*</SCRIPT>","") into pHtml
>> put replacetext(pHtml,"(?Usi)<STYLE>.*</STYLE>","") into pHtml
>> put replacetext(pHtml,"(?Usi)<\?.*\?>","") into pHtml
>> -----
>> replace " " with space in pHtml
>> replace "
>> " with return in pHtml
>> replace "
>>
>> " with return in pHtml
>> -----
>> put "<[^><]*>" into tRegex
>> put replacetext(pHtml,tRegex,"") into pHtml
>> put replacetext(pHtml,tRegex,"") into pHtml
>>
>> ----- repeat replacements until there are no changes
>> repeat until tPrevText is pHtml
>> put pHtml into tPrevText
>> put replacetext(pHtml," +",space) into pHtml
>> put replacetext(pHtml,"^ ","") into pHtml
>> end repeat
>> -----
>> replace (space & return) with return in pHtml
>> replace (return & space) with return in pHtml
>> filter pHtml without empty
>> replace numtochar(160) with empty in pHtml
>> -----
>> replace """ with quote in pHtml
>> repeat with i = 1 to the number of items of kHtml
>> replace item i of kHtml with item i of kConvertedHtml in pHtml
>> end repeat
>> -----
>> --put pHtml into msg --let's you see the result in the msg box
>> return pHtml
>> end StripTags
>>
>>
>> Jim Ault
>> Las Vegas
>>
>> ------------------------------------------------
>> --------------------------------
>>
>>
>>
>>
>> _______________________________________________
>> use-revolution mailing list
>> use-revolution at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-revolution
>
>
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
More information about the use-livecode
mailing list