Web Robots in Revolution

Mark Schonewille m.schonewille at economy-x-talk.com
Thu Mar 13 09:54:45 EDT 2008


Gregory,

This should be simple, though time consuming. If you don't know from  
which domains you want to download news, theres's no need to customize  
for individual sites. Just get the links from a search engine and  
parse the list.

Then download the sites and determine where the body text starts and  
finishes or simply remove everything that probably isn't body text  
(e.g. texts with a relatively large number of exlamation marks, short  
paragraphs without punctuation etc). Eventually store the text and  
mark the texts that seem to consist of body text only or garbage only  
for manual review.

Naturally, you will want to ignore sites with particular words in them  
and domains with particular words, you probably also want to ignore  
nonsense domains (say longer than 7 chars with 0 or 1 vowel in them).  
I'm sure, you'll find more ways to filter the search results when you  
start testing.

Important is that you make your filters adjustable --preferably with a  
nice GUI-- so you can tweak them without changing your scripts.

Best regards,

Mark Schonewille

--

Economy-x-Talk Consulting and Software Engineering
http://economy-x-talk.com
http://www.salery.biz

A large collection of scripts for HyperCard, Revolution, SuperCard and  
other programming languages can be found at http://runrev.info




On 13 mrt 2008, at 14:34, Gregory Lypny wrote:

> Hello everyone,
>
> I'm working on a major research project that involves the analysis  
> of hundreds of thousands of news releases.  I've used Revolution to  
> build utility applications that will index news files that I've  
> obtained from Factiva, but now I'd like to expand my news sources.   
> I'm hoping that you can advise me on the feasibility of building  
> something in Revolution that would submit multiple queries (e.g.,  
> news for 2005 having to do with patent rejections), extract the  
> links to the hits, then run through them and grab the individual  
> stories to catalogue them.  I can appreciate that it would have to  
> be customized for each news site.  Any insights on the general  
> approach would be most appreciated.
>
> Regards,
>
>
> Gregory Lypny
>
> Associate Professor of Finance
> John Molson School of Business
> Concordia University
> Montreal, Canada




More information about the Use-livecode mailing list