Speeding up get URL
alex at tweedly.net
Mon Aug 4 08:13:18 EDT 2008
Sorry if this message comes through twice - first attempt might have
failed, so I'm resending form a different account.
Sarah Reichelt wrote:
> On Mon, Aug 4, 2008 at 12:35 AM, Shari <shari at gypsyware.com> wrote:
>> Goal: Get a long list of website URLS, parse a bunch of data from each
>> page, if successful delete the URL from the list, if not put the URL on a
>> different list. I've got it working but it's slow. It takes about an hour
>> per 10,000 urls. I sell tshirts. Am using this to create informational
>> files for myself which will be frequently updated. I'll probably be running
>> this a couple times a month and expect my product line to just keep on
>> growing. I'm currently at about 40,000 products but look forward to the day
>> of hundreds of thousands :-) So speed is my need... (Yes, if you're
>> interested my store is in the signature, opened it last December :-)
>> How do I speed this up?
> Shari, I think the delay will be due to the connection to the server,
> not your script, so there may not be a lot you can do about it.
> I did have one idea: can you try getting more than one URL at the same
> time? If you build a list of the URLs to check, then have a script
> that grabs the first one on the list, and sends a non-blocking request
> to that site, with a message to call when the data has all arrived.
> While waiting, start loading the next site and so on. Bookmark
> checking software seems to work like this.
You should be able to achieve that using 'load URL' - set off a number
of 'load's going and then by checking the URLstatus you can process them
as they have finished arriving to your machine; and as the number of
outstanding requested URLs decreases, set off the next batch of 'load's.
But the likelihood is that this would only make a small difference - the
majority of the time is probably due to either the server response times
and/or the delay in simply downloading all those bytes to your machine.
Out of interest I'd be inclined to count the number of bytes transferred
per URL and see if that is a significant percentage of your connection
Are you running these from a machine behind a (relatively) slow Internet
connection, such as a DSL or cable modem ?
If so, you might get a big improvement by converting the script into a
CGI script, and running it on your own web-hosting server; that would
give you an effective bandwidth based on the ISP, rather than on a slow
DSL-like connection. (I have a vaguely similar script I run from my site
that is approx 1000x times faster than running it from home on a 8Mbs
DSL - the lower latency helps as much as the increased bandwidth). But
beware - if there are any issues with looking like a DoS attack, or
sending too many requests per second, this might be much more likely to
trigger them; you may also run into issues with usage of CPU and/or
bandwidth on your hosting-ISP.
> Would opening a socket and reading from the socket be any faster? I
> don't imagine that it would be, but it might be worth checking.
> The other option is just to adjust things so it is not intrusive e.g.
> have it download the sites overnight and save them all for processing
> when you are ready, or have a background app that does the downloading
> slowly (so it doesn't overload your system).
On that same idea, but taking it further (maybe too far) - how
absolutely up-to-date does the info need to be when you run the script ?
Could you process a few thousand URLs per night, caching either the URLs
as files locally, or caching the extracted data from them. Then when you
want to run your script, you use all the cached data - so some of it is
right up to date, while other parts may be up to a few days old. You
may also know, or be able to find out, which of the URLs tend to change
frequently, and therefore bias the background processing accordingly.
And, finally, a couple of trivial issues .....
> # toDoList needs to have the successful URLs deleted, and failed URLs
> moved to a different list
> # that's why p down to 1, for the delete
> # URLS are standard http://www.somewhere.com/somePage
> repeat with p = the number of lines of toDoList down to 1
> put url (line p of toDoList) into tUrl
> # don't want to use *it* because there's a long script that follows
> # *it* is too easily changed, though I've heard *it* is faster
> than *put*
> # do the stuff
> if doTheStuffWorked then
> delete line p of toDoList
> else put p & return after failedList
> updateProgressBar # another slowdown but necessary, gives a
> count of how many left to do
> end repeat
I don't fully understand this (??). What you describe is doing BOTH
delete the successful ones, and ALSO save the failed ones - so at the
end, toDoList should finish up the same as failedList. But what your
pseudo-code actually does is save the indexes of the failed URLs - which
become invalid once you delete lower numbered lines; I think you
intended to do
else put (line p of toDoList) after failedList
If you are saving the failedList, then there is no need to modify the
toDoList - so I'd simply change the loop to be
repeat for each line tURLName of toDoList
put url tURLName into tURL
else put tURLName after failedList
Of course, this isn't going to make any noticeable difference to the
processing speed, but I think it's worth changing just to make it
cleaner and easier to understand / maintain.
Similarly, the updating of the progress bar is unlikely to be
significant compared to downloading the URLs, but you could do something
to minimize it - either update it only every second (or every few
seconds), or every 100 (1000?) URLs processed, etc. I personally really,
really like to see the estimated time left as well as the number left to
do - even a not very good estimate is better than my mental arithmetic :-)
Alex Tweedly mailto:alex at tweedly.net www.tweedly.net
More information about the use-livecode