Stripping html tags
Jim Ault
JimAultWins at yahoo.com
Sat Nov 3 11:53:27 EDT 2007
The 'seriously detailed stripper' was written by Eric, and I made some
adjustments for converting a web page to a formatted data set, therefore
some special lines were added. I did not post the complete version, since
it was a custom solution.
Sorry about the confusion when the subject was simply 'tag stripping'
Step 1
You correction is not actually the right way:
The function here is to add numtoChar(160) before every tag "<td"
> replace "<td" with numtochar(160)&"<td" in pHtml
> should be...
> replace "<td" with numtochar(160)&"td>" in pHtml
so
replace "<td" with numtochar(160)&"<td" in pHtml
is intended. Later numtochar(160) will be replaced with a cr. In the full
workflow, numtochar(160) will occur for many reasons and in the end stage
all of these will be converted to cr to create a table of the core data.
Step 2
Yes, emailers can morph the tags.
Should have posted between <pre>...</pre> to avoid this.
-----
replace " " with space in pHtml
replace "<B"&"R>" with return in pHtml --BR
replace "<p" &">" with return in pHtml --p tag
-----
so....
in a web page, white space and returns mean nothing to the browser, except
for the single space. A run of spaces in a web html document are
interpreted as a single space to the viewer, so we spend a few lines in
transcript converting to a space char, then dealing with what the
space characters will mean as data separators (eg a table of values). In
this case, I wanted to convert spaces in part of a web doc to tabs, but
other sections of the document could be discarded, so this worked well for
my app.
In addition, returns mean nothing to a web browser so they can be replaced
with empty.
Also important for me was the specific order of replacements to extract the
data from a web page.
Hope this clarifies some of the gymnastics I went through for tag stripping
and data mining.
Jim Ault
Las Vegas
On 11/3/07 1:48 AM, "FlexibleLearning at aol.com" <FlexibleLearning at aol.com>
wrote:
>
> This is a seriously detailed stripper, Jim!
>
> Small error in syntax:
>
> replace "<td" with numtochar(160)&"<td" in pHtml
> should be...
> replace "<td" with numtochar(160)&"td>" in pHtml
>
> Also, a couple of lines were posted html2Txt-mangled. Could you clarify:
> -----
> replace " " with space in pHtml
> replace "
> " with return in pHtml
> replace "
>
> " with return in pHtml
> -----
>
> If you post the handler as plain text, any html formatted text should be
> correctly handled by the emailer.
>
>
> /H
>
> -------------------------------
> -------------------------------------------------
> function StripTags pHtml
> local tRegex,tPrevText
> get ("é,à,ç")
> get it & (",>,<,ê")
> get it & (",è,©,")
> get it & (",',·,&")
> -- add more chars if you wish, then...
> constant kHtml = it
> constant kConvertedHtml = "é,à,ç,>,<,ê,è,©"
> --using contants means you cannot accidentally
> -- modify these vars and damage the results
> -----
> replace numtochar(13) with empty in pHtml
> replace tab with empty in pHtml
> replace "<td" with numtochar(160)&"<td" in pHtml
> -----
> put replacetext(pHtml,"(?Usi)<SCRIPT.*</SCRIPT>","") into pHtml
> put replacetext(pHtml,"(?Usi)<STYLE>.*</STYLE>","") into pHtml
> put replacetext(pHtml,"(?Usi)<\?.*\?>","") into pHtml
> -----
> replace " " with space in pHtml
> replace "
> " with return in pHtml
> replace "
>
> " with return in pHtml
> -----
> put "<[^><]*>" into tRegex
> put replacetext(pHtml,tRegex,"") into pHtml
> put replacetext(pHtml,tRegex,"") into pHtml
>
> ----- repeat replacements until there are no changes
> repeat until tPrevText is pHtml
> put pHtml into tPrevText
> put replacetext(pHtml," +",space) into pHtml
> put replacetext(pHtml,"^ ","") into pHtml
> end repeat
> -----
> replace (space & return) with return in pHtml
> replace (return & space) with return in pHtml
> filter pHtml without empty
> replace numtochar(160) with empty in pHtml
> -----
> replace """ with quote in pHtml
> repeat with i = 1 to the number of items of kHtml
> replace item i of kHtml with item i of kConvertedHtml in pHtml
> end repeat
> -----
> --put pHtml into msg --let's you see the result in the msg box
> return pHtml
> end StripTags
>
>
> Jim Ault
> Las Vegas
>
> ------------------------------------------------
> --------------------------------
>
>
>
>
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
More information about the use-livecode
mailing list