text parsing question

Jim Ault JimAultWins at yahoo.com
Fri Jun 9 22:39:53 EDT 2006


On 6/9/06 6:23 PM, "Josh Mellicker" <josh at dvcreators.net> wrote:

> Thanks for all the suggestions, I have learned a lot!

The one caution I would give about parsing HTML code is that you need to
consider the source and variations of 'legal' code.  As a rule, white space
is ignored and used by the programmer, not intended for the viewer.

Also, malformed and extra tags are a frequent occurrence, especially by
pages generated with php & asp databases.

This makes RegEx a bit problematic, although I use it extensively.  Eric
Chatonet has made a very good tag cleaner, but it will likely need to be
tweaked for anyone's particular case.

Assume that whatever text string you are searching can also occur in the
content, or file name, or java script or meta tags.  This means a certain
degree of anarchy, but going for some fundamental cleanup routines will help
immensely.

The reason I gave my example as using cr to make each part land on its own
line is that I found this to be the most useful way to debug what was
happening in my steps.  Also, the cr means nothing to an HTML parser, which
means they can be stripped at the beginning.  This beneficial if for some
reason the string you are looking for happens to span two 'lines' (as we
think of them).

For example, all four of the following will work and do the same thing in
HTML for most versions of browsers.

</TD></TR>

  </TD>
</TR>

    </  TD>
             </      TR>


    </  
TD>
</      TR>

Jim Ault 
Las Vegas 





More information about the use-livecode mailing list