strip html

Fri Aug 2 14:41:01 EDT 2002

The html at NY Times is a nightmare!

I use RR to read & decode many html pages, many times during the day. The approach that works for me is to first isolate & only look at that part of a page that I am interested in - then, if need be, I will throw this at it:

( Assume a global "R" which contains the html.  Note that for my application I need to preserve links as well as anchor text. )

on eatHTML
  put R into altString
  put 0 into StartChar

  put "<" into t1
  put ">" into t2
  repeat

    put offset(t1,R,StartChar) into whereStart
    if whereStart = 0 then exit repeat

    put offset(t2,R,StartChar) into whereEnd
    if whereEnd = 0 then exit repeat

    put char StartChar+whereStart to StartChar+whereEnd of R into deHTML

    if ( "<a" is in deHTML ) or ( "</a" is in deHTML)
    then
      -- do nothing
    else
      put length(deHTML)-1 into tLen
      put offset(deHTML,altString) into whereAlt
      if whereAlt  0 then
        delete char whereAlt to whereAlt+tLen of altString
      end if
    end if

    put StartChar+whereEnd into StartChar
  end repeat

  repeat with i = the number of lines in altString down to 1
    if line i of altString is empty then delete line i of altString
  end repeat

  put altString into R
end eatHTML

In most cases I've found it's not necessary to resort to this procedure, especially if it's "good" modern html that's structured with css style sheets - in which case it's just a matter of looking for the correct styles & grabbing needed info.

But the NY Times, like I said, is a nightmare.

Hope this helps,
Eric
Thinker Toys, Inc.