AW: Problem with loop of array elements

Thomas Fischer fischer at mail.sub.uni-goettingen.de
Wed Nov 16 05:52:47 EST 2005


Hi Jim,

thank you for the hints.

> Thomas, another technique you might try is locating and using the href =
> ³www.url.com² string.
> This may not be suitable for your purpose, however.
> It also does not explain the debugger anomaly you saw.
Yes, I am still waiting for an explanation of that one.

>...
> This assumes there could be various forms of HTML code, where the 
> programmer uses returns to make it look good to the eye,
> but the browser simply ignores them.
> Thus the following (3) URL variations are identical to the browser
> ...

If this is an arbitrary web pages, things can become fairly complicated. I don't know what the correct rules are for the interpretation of a line break in the html source.
I know that
<A HREF="www.bigbusiness.com/
product334.html>
as well as
<A
HREF="www.bigbusiness.com/product334.html>
are interpreted as
<A HREF="www.bigbusiness.com/product334.html>
so a line break may be interpreted as empty or as space.

In my case I am looking into Google results, which are pretty standardized, and I don't want _all_ links, but only those to the found pages. And these tend to be the first word in quotes after "<p class=g>".

In the general setting one would have to gather examples of the weird things that may happen, but in any case one would have to get rid of returns and extract the <BASE HREF="..."> information if present.

>From my experience
- no need to worry about numtochar(210) and numtochar(211), these are interpreted as characters, not as quotes
- but there may be links with no quotes at all (will work with Firefox anyway).

For bulk processing (e.g. harvesting entire web sites) I would shy away from regular expressions (unless speed is improved dramatically) and try something like

replace numToChar(10) with empty in theSearchResult
replace numToChar(13) with empty in theSearchResult
replace "href =" with return in theSearchResult
replace "href=" with return in theSearchResult
-- (replace is case insensitive by default)
repeat for each line myLine in theSearchResult
  put word 1 of myLine & return after foundURLs
end repeat

And by the way, I think that something along these lines will be a better solution to my first problem as well, getting rid of any array.

All the best
Thomas




More information about the use-livecode mailing list