How to structure HTML text (tags and attributes) for processing in LiveCode?

Peter Brigham MD pmbrig at gmail.com
Sun Jun 12 18:37:25 EDT 2011


On Jun 12, 2011, at 5:24 PM, Jim Ault wrote:

> On Jun 12, 2011, at 1:15 PM, Keith Clarke wrote:
>> I am a LiveCode novice (<1 year, so still a Rookie!). So, part of the challenge with LiveCode (and indeed, software development in general for me) is understanding the art of the possible.
> 
> 
> 
> If you have a URL, I could give some concrete examples of good steps to consider.
> The reality is that there are different ways that become easier depending on the structure of the html.
> 
> If the page is a catalog or inventory page, there are lots of repeating blocks of code.  If it is a set of tables, nested tables, or other cellular structure, then the approach should be different.
> 
> If it is mostly text blocks in <p> <div> without links, or with <a> tags, again can change the structure of your repeat loops.
> 
> One approach is to keep replacing the html strings with delimiters, such as tab, cr, ~, or a run of chars such as "MMMM" and then pase the resulting text block.  There are almost too many ways to do this.  The key is to those that fit together, rather than try to mix and match combinations of all of them.
> 
> This is usually a very confusing puzzle, since html offers so many variations and exceptions.
> 
> Hope this helps rather than add to the confusion.
> 
> Jim Ault
> Las Vegas

One technique is to replace a repeating string that sets off pieces of content you want to isolate with a single character, e.g., 

   replace "<td valign="top" align=center>" with divChar in tText

then use that as an itemdelim or linedelim to access "item n of tText" or "line n of tText"

You need to make sure that the characters you will use as delimiters are not found in the text. For HTML, all you have to do is use any higher-ASCII characters. Here is a way of doing that that actually works also with text that may contain high-ASCII characters, but you can use this fine to parse HTML text.

---------

local lineChar, itemChar, cellDivChar
local textBlockDivider, frameDivider -- extras as needed

on assignDelims tText
   put getdelimiters(tText) into delimArray
   -- an array of high-ASCII characters not found in tText
   put "lineChar,itemChar, cellDivChar" into delimList
   -- add more if you want, these are 
   put the number of items of delimList into nbrDelimsNeeded
   repeat with i = 1 to nbrDelimsNeeded
      put delimArray[i] into tDelim
      if tDelim = empty then
         answer "Could only assign" && i-1 && "out of" && \
                 nbrDelimsNeeded & "!"
         -- in case you have high nbrDelimsNeeded
         -- or tText contains lots of unusual characters,
         -- and the list of allowable delims is very short
         exit to top
      end if
      do "put tDelim into" && item i of delimList
   end repeat
   
   -- you can also do this manually, for one-off parsing jobs
   -- should check that delimArray[n] is not empty if in doubt
   -- (declare script local variables as needed)

   -- put delimArray[4] into textBlockDivider
   -- put delimArray[5] into frameDivider
   -- etc
end assignDelims

on getdelimiters tText
   if tText = empty then return empty
   put "ßπ∆ƒµ¡™£¢∞§¶ªç≈…æ∑ø©®Ω" into charList
   -- don't know if this will show well in all email clients
   -- it's a string of high-ASCII characters
   put 0 into tCount
   repeat for each char tChar in charList
      if tChar is in tText then next repeat
      add 1 to tCount
      put tChar into delimList[tCount]
   end repeat
   return delimList
end getdelimiters

Then you can do things like:

   replace cr with empty in tText
   -- html ignores cr's, and extraneous returns may complicate things
   replace "<p>" with lineChar in tText
   replace "</font></td><td valign="top"><font face="Arial" size="-1">" with cellDivChar in tText
   -- or whatever the tag string is for this particular table

then:

   set the lineDelimiter to lineChar
   set the itemDelimiter to cellDivChar
   repeat for each line tLine in tText
      repeat for each item textRun in tLine
      -- do more parsing stuff here:
      -- now you can work on each block textRun
       end repeat
   end repeat

This helps to pare down some of the HTML formatting/tagging so as to use LC's powerful chunk manipulation to extract the content you want.

-- Peter

Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig






More information about the use-livecode mailing list