How to structure HTML text (tags and attributes) for processing in LiveCode?
Jim Ault
jimaultwins at yahoo.com
Sun Jun 12 09:05:38 EDT 2011
On Jun 12, 2011, at 4:14 AM, Keith Clarke wrote:
> I've got the HTML source into a reasonable shape for processing with
> line and item chunk expressions by using:
>
> put field "fld Page Source Code" into tHTML
> replace "/div>" with "/div>" & return in tHTML
> replace "/tr>" with "/tr>" & return in tHTML
> replace "/td>" with "/td>" & tab in tHTML
> filter tHTML with <strings that isolate only the interesting, data-
> laden table rows>
>
> So, I can now have line-level chunk expressions mapped to divs and
> table row tags, together with item-level expressions for iterating
> through the tags and their attributes within table rows. Nice!
>
> Now the rich seams have been revealed, it's time to start digging
> out them there nuggets! :-)
Eric Chatonet and I used to exchange emails about screen scrapping for
data mining.
One of the first operations we do is ...
--since cr's mean nothing to html
replace cr with empty in html
replace "<" with (cr & "<") in html
replace ">" with (">" & cr) in html
filter html without empty
--now all tags are on their own line
-- and the runs of text are on their own line
-- The tag attributes may not be data you care about
-- in xml, attributes are critical data storage
--assuming that you don't have a page that contains commented lines
such as
<!--
<script language="javascript">
</script>
-->
If this is true, and it is important, then you need to extract these
as the first step.
--------------------
The next consideration is html entities, such as & < >
and various vowels, etc. Do a Google for 'html entities' and you
should be able to find convenient tables to copy-paste the ones you
care about.
put "& &, < <, > >" into searchStrings
repeat for each item STRR in searchStrings
replace (word 1 of STRR) with (word 2 of STRR) in htmlBlock
end repeat
----
if you are data mining tables. MatchChunk is your friend, but you may
not care to get cozy with regEx. Tables can have important info in
the header labels, and rows of data may be something you want to keep
intact.
Of course, oddities occur with table html that includes rowspan and
colspan, so those are special cases.
It is unlikely that CSS will affect your data mining and data group
relationships.
And another consideration that I haven't worried about to this moment
is multi-lingual data sources.
Hope this helps you get started on your mining exploration.
Jim Ault
Las Vegas
More information about the use-livecode
mailing list