Website scraping - How can I load a 'partial' page?
bonnmike at gmail.com
Wed Dec 13 11:23:59 EST 2017
Hmm, or use range as mentioned in my other mail.
If the server supports range requests you can set your headers to include--
Range: bytes=0-2000 to get the first 2000 bytes.
or use curl with -r 0-2000 but i have yet to find a page that will return
only a range.
Apparently you can find out if a page will accept ranges using curl with
something like this..
curl -I http://i.imgur.com/z4d4kWk.jpg
HTTP/1.1 200 OK
if it has "Accept=Ranges: bytes" as part of the response, it should work.
I'm still thinking the intermediary method is best.
On Wed, Dec 13, 2017 at 8:39 AM, Mike Bonner <bonnmike at gmail.com> wrote:
> I suppose one could use sockets and partial GET requests (using a range:
> header), but i suspect it would be easier to just use an intermediary
> server to handle things. To test, I set up an extremely simple page with
> the following:
> put $_GET["page"] into tPage -- a get request TO my pageof the form ?page=
> put char 1 to 6000 of url tpage -- request the page to be scraped and
> return the first 6000 chars
> To use this is a simple-- get URL "http://path.to.my.page.com/
> if the page to be scraped uses a get style request, it will might be
> better to use post instead.
> In this way you can use a server on a hot connect to do the heavy lifting
> and then just send the results back down. In fact, you could probably have
> the server itself do the scraping and just return any final results (or pop
> the results into a database or whatever) Also in fact, if you have enough
> control of the server, and need to scrape the same page over and over for
> changes you could most likely set up a cronjob to do the work and a front
> end to pull the results. (don't know what your final objective is, so hard
> to say whats best)
> On Wed, Dec 13, 2017 at 6:39 AM, Roger Eller via use-livecode <
> use-livecode at lists.runrev.com> wrote:
>> I have a webpage that I grab with LiveCode, then parse out what I need.
>> The data I keep is within the first 1/4th of the page.
>> Rather than loading the entire page into a variable or a browser object,
>> how can I load just the portion that I need and then stop the transmission
>> instead of wasting the time and bandwidth to load the entire page?
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
More information about the Use-livecode