Analyzing News Stories

Alex Tweedly alex at tweedly.net
Wed Jan 26 10:27:41 EST 2005


Gregory Lypny wrote:

> Hello everyone,
>
> I'm starting a research project that will relate the flow of 
> information, in the form of news reports from Reuters and Canada 
> NewsWire, to activity in capital markets, namely, the volume and 
> volatility of trade on the stock market.  The Canadian news sample 
> will consist of about 240,000 stories going back to 1995.  I have two 
> questions about processing the news data with Revolution.
>
> 1.  Creating an index of key words.  I will use arrays to create 
> indexes of key words from the headlines and the stories.  There are 
> many words that must be kept out of the indexes: a, the, about, there, 
> this, etc.  The list of these is probably longer than I can imagine.  
> Does anyone have a list of words that would "typically" be omitted 
> from a key word index and which they'd be willing to share?  If not, 
> does anyone know where I might get such a list?

Index every word on a random sample of 100 stories. Eliminate any word 
that appears in more than 80% of them. Look briefly at those words that 
appear in between 50% and 80% and see what you think about them; if 
necessary, adjust the thresholds until it feels right for your purposes.

Do you want to index each word separately, or try to accumulate common 
roots; e.g. cause, causes, caused, causing, causation ... one entry, 2 
entries, 5 entries?

> 2.  Obtaining the data.  I already have the Canadian news stories but 
> our method of retrieving them was less than ideal, and I want to 
> improve upon this for follow up studies.  The typical set up for most 
> web sites that provide information to the public, and this is true of 
> many scientific sites, is that they take the visitor's query and then 
> return a list of hits.  These hits are displayed as hyperlinks that 
> lead to more detailed underlying information.  You know the story.  
> The problem for researchers is that we're often interested in 
> analyzing all of the information that comes up in a query.  So we 
> prefer to be able to download all of the data in some raw format, text 
> ideally, go through it, clean it up, discard what they don't need, and 
> proceed with our research.  Some sites do permit visitors to download 
> entire data sets from FTP sites or through more direct database 
> communication, but most do not.  Do you think it's feasible to create 
> a kind of web data extraction utility in Revolution?  I'm thinking 
> that I would visit a news site, for example, enter a query, and then 
> use resulting tags that appear in the URL field to reverse engineer 
> the way their database handles the queries so that I can automate it 
> in Revolution.  Here's an example of one hit from Canada NewsWire 
> drilled down through the hyperlinks to an actual story.
>
> http://www.newswire.ca/en/releases/archive/January2005/26/c7010.html
>
> There's the month, year, date, and serial number (c7010) for the 
> story.  In Revolution, I would then work with
>
>     repeat ...
>         put url (the info above cycled through the serial numbers and 
> dates)
>             into a text file
>     end repeat
>
> The other piece of information that is needed, and which will vary 
> from site to site, the way the serial numbering works.  Any thoughts 
> on this approach?  Are there ethical considerations in obtaining 
> information in this way?   My guess is no because it simply means 
> going from clicking one thousand hyperlinks at a web site to clicking 
> one Revolution button to obtain the same number of stories.

I'd worry about whether I had deduced the serial numbering scheme fully. 
Did I get every story ? Could there be any particular kind of story that 
was indexed differently (e.g. stories printed straight form the AP wire 
might be indexed differently from those written, or extensively 
modified, by the paper's own writers).

There are some ethical issues about collecting large amounts of data; 
you should, at a minimum, read up on the content of the robots.txt 
system, and in general conform to the site's requests as described in 
their robots.txt files.

-- Alex.


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.4 - Release Date: 25/01/2005



More information about the use-livecode mailing list