How to structure HTML text (tags and attributes) for processing in LiveCode?
Peter Brigham MD
pmbrig at gmail.com
Sun Jun 12 18:37:25 EDT 2011
On Jun 12, 2011, at 5:24 PM, Jim Ault wrote:
> On Jun 12, 2011, at 1:15 PM, Keith Clarke wrote:
>> I am a LiveCode novice (<1 year, so still a Rookie!). So, part of the challenge with LiveCode (and indeed, software development in general for me) is understanding the art of the possible.
> If you have a URL, I could give some concrete examples of good steps to consider.
> The reality is that there are different ways that become easier depending on the structure of the html.
> If the page is a catalog or inventory page, there are lots of repeating blocks of code. If it is a set of tables, nested tables, or other cellular structure, then the approach should be different.
> If it is mostly text blocks in <p> <div> without links, or with <a> tags, again can change the structure of your repeat loops.
> One approach is to keep replacing the html strings with delimiters, such as tab, cr, ~, or a run of chars such as "MMMM" and then pase the resulting text block. There are almost too many ways to do this. The key is to those that fit together, rather than try to mix and match combinations of all of them.
> This is usually a very confusing puzzle, since html offers so many variations and exceptions.
> Hope this helps rather than add to the confusion.
> Jim Ault
> Las Vegas
One technique is to replace a repeating string that sets off pieces of content you want to isolate with a single character, e.g.,
replace "<td valign="top" align=center>" with divChar in tText
then use that as an itemdelim or linedelim to access "item n of tText" or "line n of tText"
You need to make sure that the characters you will use as delimiters are not found in the text. For HTML, all you have to do is use any higher-ASCII characters. Here is a way of doing that that actually works also with text that may contain high-ASCII characters, but you can use this fine to parse HTML text.
local lineChar, itemChar, cellDivChar
local textBlockDivider, frameDivider -- extras as needed
on assignDelims tText
put getdelimiters(tText) into delimArray
-- an array of high-ASCII characters not found in tText
put "lineChar,itemChar, cellDivChar" into delimList
-- add more if you want, these are
put the number of items of delimList into nbrDelimsNeeded
repeat with i = 1 to nbrDelimsNeeded
put delimArray[i] into tDelim
if tDelim = empty then
answer "Could only assign" && i-1 && "out of" && \
nbrDelimsNeeded & "!"
-- in case you have high nbrDelimsNeeded
-- or tText contains lots of unusual characters,
-- and the list of allowable delims is very short
exit to top
do "put tDelim into" && item i of delimList
-- you can also do this manually, for one-off parsing jobs
-- should check that delimArray[n] is not empty if in doubt
-- (declare script local variables as needed)
-- put delimArray into textBlockDivider
-- put delimArray into frameDivider
on getdelimiters tText
if tText = empty then return empty
put "ßπ∆ƒµ¡™£¢∞§¶ªç≈…æ∑ø©®Ω" into charList
-- don't know if this will show well in all email clients
-- it's a string of high-ASCII characters
put 0 into tCount
repeat for each char tChar in charList
if tChar is in tText then next repeat
add 1 to tCount
put tChar into delimList[tCount]
Then you can do things like:
replace cr with empty in tText
-- html ignores cr's, and extraneous returns may complicate things
replace "<p>" with lineChar in tText
replace "</font></td><td valign="top"><font face="Arial" size="-1">" with cellDivChar in tText
-- or whatever the tag string is for this particular table
set the lineDelimiter to lineChar
set the itemDelimiter to cellDivChar
repeat for each line tLine in tText
repeat for each item textRun in tLine
-- do more parsing stuff here:
-- now you can work on each block textRun
This helps to pare down some of the HTML formatting/tagging so as to use LC's powerful chunk manipulation to extract the content you want.
Peter M. Brigham
pmbrig at gmail.com
More information about the Use-livecode