AW: Problem with loop of array elements

Jim Ault JimAultWins at yahoo.com
Wed Nov 16 11:52:24 EST 2005


Yes, you are getting into the world of customizing to purge all the
interface controls.  Sometimes this is called 'screen-scraping'.  There are
several web sites that use Flash or Java to prevent this, especially images
and graphics that have a copyright.

I have several custom solutions that adjust for individual site formats.

The reason for converting the curly quotes is that you were using 'quote' as
an item delimiter.  Some authors will tweak their code in Microsoft Word or
other word processor, copy and paste, therefore mixing in some quote
characters that are not char(34) when you are expecting such.

Hope this helps

Jim Ault
Las Vegas
-----------
On 11/16/05 2:52 AM, "Thomas Fischer" <fischer at mail.sub.uni-goettingen.de>
wrote:

> Hi Jim,
> 
> thank you for the hints.
> 
>> Thomas, another technique you might try is locating and using the href =
>> ³www.url.com² string.
>> This may not be suitable for your purpose, however.
>> It also does not explain the debugger anomaly you saw.
> Yes, I am still waiting for an explanation of that one.
> 
>> ...
>> This assumes there could be various forms of HTML code, where the
>> programmer uses returns to make it look good to the eye,
>> but the browser simply ignores them.
>> Thus the following (3) URL variations are identical to the browser
>> ...
> 
> If this is an arbitrary web pages, things can become fairly complicated. I
> don't know what the correct rules are for the interpretation of a line break
> in the html source.
> I know that
> <A HREF="www.bigbusiness.com/
> product334.html>
> as well as
> <A
> HREF="www.bigbusiness.com/product334.html>
> are interpreted as
> <A HREF="www.bigbusiness.com/product334.html>
> so a line break may be interpreted as empty or as space.
> 
> In my case I am looking into Google results, which are pretty standardized,
> and I don't want _all_ links, but only those to the found pages. And these
> tend to be the first word in quotes after "<p class=g>".
> 
> In the general setting one would have to gather examples of the weird things
> that may happen, but in any case one would have to get rid of returns and
> extract the <BASE HREF="..."> information if present.
> 
>> From my experience
> - no need to worry about numtochar(210) and numtochar(211), these are
> interpreted as characters, not as quotes
> - but there may be links with no quotes at all (will work with Firefox
> anyway).
> 
> For bulk processing (e.g. harvesting entire web sites) I would shy away from
> regular expressions (unless speed is improved dramatically) and try something
> like
> 
> replace numToChar(10) with empty in theSearchResult
> replace numToChar(13) with empty in theSearchResult
> replace "href =" with return in theSearchResult
> replace "href=" with return in theSearchResult
> -- (replace is case insensitive by default)
> repeat for each line myLine in theSearchResult
>   put word 1 of myLine & return after foundURLs
> end repeat
> 
> And by the way, I think that something along these lines will be a better
> solution to my first problem as well, getting rid of any array.
> 
> All the best
> Thomas
> 
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution





More information about the use-livecode mailing list