Speeding up get URL

Jim Ault JimAultWins at yahoo.com
Sun Aug 3 12:20:54 EDT 2008


The major limitation for your case is that each request sent to a web server
is dependent on the response time from that web server.  Some servers
intentionally return with a delay to control bandwidth demands and load
balancing, especially if some of their hosted customers are downloading
videos or games or music or flash files.

You could add a timer to you process and find out which sites return the
slowest, but this may not be true every time you access same site.

The rate of 10,000 per hour sounds about right since most product pages on
the internet are served from a database and are not static pages.

One service provider that I extract data from does not want more than one
hit every 50 seconds in order to be of service to hundreds of simultaneous
users, so they protect themselves from "denial of service attacks" that
overload their machines.

One of my hosting companies does not charge extra for bandwidth, but has set
the load balancing so that I could not serve videos effectively.  Works
great for blogs and low-volume pages.  Not good for music and games.

I would just let the app run overnight since you are only doing it a couple
times a month.  Of course you could run the app on two computers to double
the speed, then merge the results.

Another step might be to see if the sites have a listing page with you
desired data.  This might produce 10-50 products on one page rather than
10-50 different web pages.

Just a thought...  One factor might be if your list has the same domain
appearing as a contiguous block, the web server may be detecting that you
are not a human browsing and slow down the transfer rate.  One of my hosting
companies does this because they had bad experiences with denial of service
attacks.

Hope this helps.

Jim Ault
Las Vegas


On 8/3/08 7:35 AM, "Shari" <shari at gypsyware.com> wrote:

> Goal:  Get a long list of website URLS, parse a bunch of data from
> each page, if successful delete the URL from the list, if not put the
> URL on a different list.  I've got it working but it's slow.  It
> takes about an hour per 10,000 urls.  I sell tshirts.  Am using this
> to create informational files for myself which will be frequently
> updated.  I'll probably be running this a couple times a month and
> expect my product line to just keep on growing.  I'm currently at
> about 40,000 products but look forward to the day of hundreds of
> thousands :-)  So speed is my need... (Yes, if you're interested my
> store is in the signature, opened it last December :-)
> 
> How do I speed this up?
> 
> # toDoList needs to have the successful URLs deleted, and failed URLs
> moved to a different list
> # that's why p down to 1, for the delete
> # URLS are standard http://www.somewhere.com/somePage
> 
>    repeat with p = the number of lines of toDoList down to 1
>        put url (line p of toDoList) into tUrl
>        # don't want to use *it* because there's a long script that follows
>        # *it* is too easily changed, though I've heard *it* is faster than
> *put*
>        # do the stuff
>        if doTheStuffWorked then
>           delete line p of toDoList
>        else put p & return after failedList
>        updateProgressBar # another slowdown but necessary, gives a
> count of how many left to do
>     end repeat





More information about the use-livecode mailing list