Analyzing News Stories
gregory.lypny at videotron.ca
Wed Jan 26 09:48:54 EST 2005
I'm starting a research project that will relate the flow of
information, in the form of news reports from Reuters and Canada
NewsWire, to activity in capital markets, namely, the volume and
volatility of trade on the stock market. The Canadian news sample will
consist of about 240,000 stories going back to 1995. I have two
questions about processing the news data with Revolution.
1. Creating an index of key words. I will use arrays to create
indexes of key words from the headlines and the stories. There are
many words that must be kept out of the indexes: a, the, about, there,
this, etc. The list of these is probably longer than I can imagine.
Does anyone have a list of words that would "typically" be omitted from
a key word index and which they'd be willing to share? If not, does
anyone know where I might get such a list?
2. Obtaining the data. I already have the Canadian news stories but
our method of retrieving them was less than ideal, and I want to
improve upon this for follow up studies. The typical set up for most
web sites that provide information to the public, and this is true of
many scientific sites, is that they take the visitor's query and then
return a list of hits. These hits are displayed as hyperlinks that
lead to more detailed underlying information. You know the story. The
problem for researchers is that we're often interested in analyzing all
of the information that comes up in a query. So we prefer to be able
to download all of the data in some raw format, text ideally, go
through it, clean it up, discard what they don't need, and proceed with
our research. Some sites do permit visitors to download entire data
sets from FTP sites or through more direct database communication, but
most do not. Do you think it's feasible to create a kind of web data
extraction utility in Revolution? I'm thinking that I would visit a
news site, for example, enter a query, and then use resulting tags that
appear in the URL field to reverse engineer the way their database
handles the queries so that I can automate it in Revolution. Here's an
example of one hit from Canada NewsWire drilled down through the
hyperlinks to an actual story.
There's the month, year, date, and serial number (c7010) for the story.
In Revolution, I would then work with
put url (the info above cycled through the serial numbers and dates)
into a text file
The other piece of information that is needed, and which will vary from
site to site, the way the serial numbering works. Any thoughts on this
approach? Are there ethical considerations in obtaining information in
this way? My guess is no because it simply means going from clicking
one thousand hyperlinks at a web site to clicking one Revolution button
to obtain the same number of stories.
Associate Professor of Finance
John Molson School of Business
More information about the Use-livecode