Website scraping - How can I load a 'partial' page?

Mike Bonner bonnmike at gmail.com
Wed Dec 13 11:23:59 EST 2017


Hmm, or use range as mentioned in my other mail.

If the server supports range requests you can set your headers to include--
Range: bytes=0-2000    to get the first 2000 bytes.

or use curl with -r 0-2000 but i have yet to find a page that will return
only a range.

Apparently you can find out if a page will accept ranges using curl with
something like this..

curl -I http://i.imgur.com/z4d4kWk.jpg

HTTP/1.1 200 OK
...
Accept-Ranges: bytes
Content-Length: 146515

if it has "Accept=Ranges: bytes" as part of the response, it should work.
I'm still thinking the intermediary method is best.



On Wed, Dec 13, 2017 at 8:39 AM, Mike Bonner <bonnmike at gmail.com> wrote:

> I suppose one could use sockets and partial GET requests (using a range:
> header), but i suspect it would be easier to just use an intermediary
> server to handle things.  To test, I set up an extremely simple page with
> the following:
>
> <?lc
> put $_GET["page"] into tPage -- a get request TO my pageof the form ?page=
> http://url.goes.here
>  put char 1 to 6000 of url tpage  -- request the page to be scraped and
> return the first 6000 chars
>
> ?>
> To use this is a simple--  get URL "http://path.to.my.page.com/
> scrape.lc?page=http://server.to.scrape.com/pagetoscrape.html"
>
> if the page to be scraped uses a get style request, it will might be
> better to use post instead.
>
> In this way you can use a server on a hot connect to do the heavy lifting
> and then just send the results back down.  In fact, you could probably have
> the server itself do the scraping and just return any final results (or pop
> the results into a database or whatever)  Also in fact, if you have enough
> control of the server, and need to scrape the same page over and over for
> changes you could most likely set up a cronjob to do the work and a front
> end to pull the results.  (don't know what your final objective is, so hard
> to say whats best)
>
>
>
> On Wed, Dec 13, 2017 at 6:39 AM, Roger Eller via use-livecode <
> use-livecode at lists.runrev.com> wrote:
>
>> I have a webpage that I grab with LiveCode, then parse out what I need.
>> The data I keep is within the first 1/4th of the page.
>>
>> Rather than loading the entire page into a variable or a browser object,
>> how can I load just the portion that I need and then stop the transmission
>> instead of wasting the time and bandwidth to load the entire page?
>>
>> ~Roger
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
>



More information about the use-livecode mailing list