How to structure HTML text (tags and attributes) for processing in LiveCode?

Jim Ault jimaultwins at yahoo.com
Sun Jun 12 09:05:38 EDT 2011


On Jun 12, 2011, at 4:14 AM, Keith Clarke wrote:

> I've got the HTML source into a reasonable shape for processing with  
> line and item chunk expressions by using:
>
> put field "fld Page Source Code" into tHTML
> replace "/div>" with "/div>" & return in tHTML
> replace "/tr>" with "/tr>" & return in tHTML
> replace "/td>" with "/td>" & tab in tHTML
> filter tHTML with <strings that isolate only the interesting, data- 
> laden table rows>
>
> So, I can now have line-level chunk expressions mapped to divs and  
> table row tags, together with item-level expressions for iterating  
> through the tags and their attributes within table rows. Nice!
>
> Now the rich seams have been revealed, it's time to start digging  
> out them there nuggets! :-)


Eric Chatonet and I used to exchange emails about screen scrapping for  
data mining.
One of the first operations we do is ...
--since cr's mean nothing to html

    replace cr with empty in html
    replace "<" with (cr & "<") in html
    replace ">" with (">" & cr) in html
    filter html without empty
--now all tags are on their own line
-- and the runs of text are on their own line
-- The tag attributes may not be data you care about
--  in xml, attributes are critical data storage

--assuming that you don't have a page that contains commented lines  
such as
<!--
<script language="javascript">

</script>
-->
If this is true, and it is important, then you need to extract these  
as the first step.
--------------------
The next consideration is html entities, such as &  <  >
and various vowels, etc.  Do a Google for 'html entities' and you  
should be able to find convenient tables to copy-paste the ones you  
care about.

put "& &, < <, > >" into searchStrings
repeat for each item STRR in searchStrings
     replace (word 1 of STRR) with (word 2 of STRR) in htmlBlock
end repeat
----
if you are data mining tables. MatchChunk is your friend, but you may  
not care to get cozy with regEx.  Tables can have important info in  
the header labels, and rows of data may be something you want to keep  
intact.

Of course, oddities occur with table html that includes rowspan and  
colspan, so those are special cases.
It is unlikely that CSS will affect your data mining and data group  
relationships.

And another consideration that I haven't worried about to this moment  
is multi-lingual data sources.

Hope this helps you get started on your mining exploration.

Jim Ault
Las Vegas






More information about the use-livecode mailing list