Speeding up get URL

Alex Tweedly alex.tweedly at gmail.com
Mon Aug 4 08:00:28 EDT 2008


Sarah Reichelt wrote:
> On Mon, Aug 4, 2008 at 12:35 AM, Shari <shari at gypsyware.com> wrote:
>   
>> Goal:  Get a long list of website URLS, parse a bunch of data from each
>> page, if successful delete the URL from the list, if not put the URL on a
>> different list.  I've got it working but it's slow.  It takes about an hour
>> per 10,000 urls.  I sell tshirts.  Am using this to create informational
>> files for myself which will be frequently updated.  I'll probably be running
>> this a couple times a month and expect my product line to just keep on
>> growing.  I'm currently at about 40,000 products but look forward to the day
>> of hundreds of thousands :-)  So speed is my need... (Yes, if you're
>> interested my store is in the signature, opened it last December :-)
>>
>> How do I speed this up?
>>     
>
> Shari, I think the delay will be due to the connection to the server,
> not your script, so there may not be a lot you can do about it.
>
> I did have one idea: can you try getting more than one URL at the same
> time? If you build a list of the URLs to check, then have a script
> that grabs the first one on the list, and sends a non-blocking request
> to that site, with a message to call when the data has all arrived.
> While waiting, start loading the next site and so on. Bookmark
> checking software seems to work like this.
>
>   
You should be able to achieve that using 'load URL' - set off a number 
of 'load's going and then by checking the URLstatus you can process them 
as they have finished arriving to your machine; and as the number of 
outstanding requested URLs decreases, set off the next batch of 'load's.

But the likelihood is that this would only make a small difference - the 
majority of the time is probably due to either the server response times 
and/or the delay in simply downloading all those bytes to your machine.  
Out of interest I'd be inclined to count the number of bytes transferred 
per URL and see if that is a significant percentage of your connection 
capacity.

Are you running these from a machine behind a (relatively) slow Internet 
connection, such as a DSL or cable modem ?
If so, you might get a big improvement by converting the script into a 
CGI script, and running it on your own web-hosting server; that would 
give you an effective bandwidth based on the ISP, rather than on a slow 
DSL-like connection. (I have a vaguely similar script I run from my site 
that is approx 1000x times faster than running it from home on a 8Mbs 
DSL - the lower latency helps as much as the increased bandwidth). But 
beware - if there are any issues with looking like a DoS attack, or 
sending too many requests per second, this might be much more likely to 
trigger them; you may also run into issues with usage of CPU and/or 
bandwidth on your hosting-ISP.
> Would opening a socket and reading from the socket be any faster? I
> don't imagine that it would be, but it might be worth checking.
>
> The other option is just to adjust things so it is not intrusive e.g.
> have it download the sites overnight and save them all for processing
> when you are ready, or have a background app that does the downloading
> slowly (so it doesn't overload your system).
>   
On that same idea, but taking it further (maybe too far) - how 
absolutely up-to-date does the info need to be when you run the script ?
Could you process a few thousand URLs per night, caching either the URLs 
as files locally, or caching the extracted data from them. Then when you 
want to run your script, you use all the cached data - so some of it is 
right up to date, while other parts may be up to a few days old.  You 
may also know, or be able to find out, which of the URLs tend to change 
frequently, and therefore bias the background processing accordingly.



And, finally, a couple of trivial issues .....

>
> # toDoList needs to have the successful URLs deleted, and failed URLs 
> moved to a different list
> # that's why p down to 1, for the delete
> # URLS are standard http://www.somewhere.com/somePage
>
>   repeat with p = the number of lines of toDoList down to 1
>       put url (line p of toDoList) into tUrl
>       # don't want to use *it* because there's a long script that follows
>       # *it* is too easily changed, though I've heard *it* is faster 
> than *put*
>       # do the stuff
>       if doTheStuffWorked then
>          delete line p of toDoList
>       else put p & return after failedList
>       updateProgressBar # another slowdown but necessary, gives a 
> count of how many left to do
>    end repeat 
I don't fully understand this (??). What you describe is doing BOTH 
delete the successful ones, and ALSO save the failed ones - so at the 
end, toDoList should finish up the same as failedList. But what your 
pseudo-code actually does is save the indexes of the failed URLs - which 
become invalid once you delete lower numbered lines; I think you 
intended to do
     else put (line p of toDoList) after failedList

If you are saving the failedList, then there is no need to modify the 
toDoList - so I'd simply change the loop to be
    repeat for each line tURLName of toDoList
        put url tURLName into tURL
        .....
        else put tURLName after failedList
   etc.

Of course, this isn't going to make any noticeable difference to the 
processing speed, but I think it's worth changing just to make it 
cleaner and easier to understand / maintain.

Similarly, the updating of the progress bar is unlikely to be 
significant compared to downloading the URLs, but you could do something 
to minimize it - either update it only every second (or every few 
seconds), or every 100 (1000?) URLs processed, etc. I personally really, 
really like to see the estimated time left as well as the number left to 
do - even a not very good  estimate is better than my mental arithmetic :-)

-- Alex.



More information about the use-livecode mailing list