strip html
thinkertoys
thinkertoys at cyburb.com
Fri Aug 2 14:41:01 EDT 2002
The html at NY Times is a nightmare!
I use RR to read & decode many html pages, many times during the day. The approach that works for me is to first isolate & only look at that part of a page that I am interested in - then, if need be, I will throw this at it:
( Assume a global "R" which contains the html. Note that for my application I need to preserve links as well as anchor text. )
on eatHTML
put R into altString
put 0 into StartChar
put "<" into t1
put ">" into t2
repeat
put offset(t1,R,StartChar) into whereStart
if whereStart = 0 then exit repeat
put offset(t2,R,StartChar) into whereEnd
if whereEnd = 0 then exit repeat
put char StartChar+whereStart to StartChar+whereEnd of R into deHTML
if ( "<a" is in deHTML ) or ( "</a" is in deHTML)
then
-- do nothing
else
put length(deHTML)-1 into tLen
put offset(deHTML,altString) into whereAlt
if whereAlt 0 then
delete char whereAlt to whereAlt+tLen of altString
end if
end if
put StartChar+whereEnd into StartChar
end repeat
repeat with i = the number of lines in altString down to 1
if line i of altString is empty then delete line i of altString
end repeat
put altString into R
end eatHTML
In most cases I've found it's not necessary to resort to this procedure, especially if it's "good" modern html that's structured with css style sheets - in which case it's just a matter of looking for the correct styles & grabbing needed info.
But the NY Times, like I said, is a nightmare.
Hope this helps,
Eric
Thinker Toys, Inc.
More information about the use-livecode
mailing list