Web Robots in Revolution
Mark Schonewille
m.schonewille at economy-x-talk.com
Thu Mar 13 09:54:45 EDT 2008
Gregory,
This should be simple, though time consuming. If you don't know from
which domains you want to download news, theres's no need to customize
for individual sites. Just get the links from a search engine and
parse the list.
Then download the sites and determine where the body text starts and
finishes or simply remove everything that probably isn't body text
(e.g. texts with a relatively large number of exlamation marks, short
paragraphs without punctuation etc). Eventually store the text and
mark the texts that seem to consist of body text only or garbage only
for manual review.
Naturally, you will want to ignore sites with particular words in them
and domains with particular words, you probably also want to ignore
nonsense domains (say longer than 7 chars with 0 or 1 vowel in them).
I'm sure, you'll find more ways to filter the search results when you
start testing.
Important is that you make your filters adjustable --preferably with a
nice GUI-- so you can tweak them without changing your scripts.
Best regards,
Mark Schonewille
--
Economy-x-Talk Consulting and Software Engineering
http://economy-x-talk.com
http://www.salery.biz
A large collection of scripts for HyperCard, Revolution, SuperCard and
other programming languages can be found at http://runrev.info
On 13 mrt 2008, at 14:34, Gregory Lypny wrote:
> Hello everyone,
>
> I'm working on a major research project that involves the analysis
> of hundreds of thousands of news releases. I've used Revolution to
> build utility applications that will index news files that I've
> obtained from Factiva, but now I'd like to expand my news sources.
> I'm hoping that you can advise me on the feasibility of building
> something in Revolution that would submit multiple queries (e.g.,
> news for 2005 having to do with patent rejections), extract the
> links to the hits, then run through them and grab the individual
> stories to catalogue them. I can appreciate that it would have to
> be customized for each news site. Any insights on the general
> approach would be most appreciated.
>
> Regards,
>
>
> Gregory Lypny
>
> Associate Professor of Finance
> John Molson School of Business
> Concordia University
> Montreal, Canada
More information about the use-livecode
mailing list