How to recover text from a web page

Sumner, Walt WSUMNER at dom.wustl.edu
Wed Sep 22 15:14:44 EDT 2010


Thanks for the lead on screen scrapes, but the problem is there is nothing
to scrape. The "put URL(...)" and "revBrowserGet(tBrowserId,"htmltext")"
return the html, but not all of the text that is displayed on the page.

In fact, if I use Word's merge documents tool to compare the html from pages
2, 9, and 256 of the petition, there is NO DIFFERENCE in the files. The
petition signatures and comments are embedded in a petition widget, I think,
which I suppose is some javascript applet. Whatever it is, the html
definitely does not contain the petition text that I want to evaluate.

Nevertheless it is trivial to manually select and copy all of the text on
the page. Once it is copied it is easy to automatically paste it, scrape it
(that code works fine), and store data using LiveCode, but I do not see a
way to select and copy text from this widget using LiveCode.

> On Tue, 21 Sep 2010 22:23:17, stephen barncard wrote:
> Why bother with revBrowser at all?  Just  do this in the message box:
> 
> put URL(http://website.com/page.html)
> 
>  and this will put the website html into the message box output. Obviously
> you could do this with fields.
> 
> Check out Jerry's videos on Screen Scraping:
> 
> http://revmentor.com/business-logic-screen-scraping-1
> http://revmentor.com/business-logic-screen-scraping-0
> 
> 
> On 21 September 2010 22:16, Sumner, Walt <WSUMNER at dom.wustl.edu> wrote:
> 
>> I am trying to recover text from this web page and all of its siblings:
>> 
>> 
>> http://www.thepetitionsite.com/1/keep-life-saving-electronic-cigarettes-avail
>> able/#sigs/691732733/user/1
>> 
>> The interesting part of the page is the comments, which do not appear in
>> the HTML, but which can be copied manually. I can open this page in a
>> browser in LiveCode. With manual mouse motions, I can double click a block
>> of text, choose "Select All" from the "Edit" menu, choose "Copy" from the
>> "Edit" menu, and then paste into a field where the comments all appear and
>> are easy to disassemble.
>> 
>> Unfortunately, the revbrowser set command and get function do not do
>> anything comparable AFAICT. The "Select All" choice is not implemented in
>> the DoMenu command. I think that printing a pdf is also out. So, any
>> thoughts on how to automate this part of a petition review? For instance,
>> maybe there is a simple way to save the text to a file with the
>> revBrowserExecuteScript function (using JavaScript for Safari)?
>> 
>> BTW, the browser is fully capable of crashing LiveCode on at least some OSX
>> machines. Please don't lose any work for me.
>> 
>> Thanks,
>> 
>> Walt_______________________________________________
>> use-revolution mailing list
>> use-revolution at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-revolution
>> 

Walton Sumner
 





More information about the use-livecode mailing list