Analyzing News Stories

Gregory Lypny gregory.lypny at videotron.ca
Wed Jan 26 09:48:54 EST 2005


Hello everyone,

I'm starting a research project that will relate the flow of 
information, in the form of news reports from Reuters and Canada 
NewsWire, to activity in capital markets, namely, the volume and 
volatility of trade on the stock market.  The Canadian news sample will 
consist of about 240,000 stories going back to 1995.  I have two 
questions about processing the news data with Revolution.

1.  Creating an index of key words.  I will use arrays to create 
indexes of key words from the headlines and the stories.  There are 
many words that must be kept out of the indexes: a, the, about, there, 
this, etc.  The list of these is probably longer than I can imagine.  
Does anyone have a list of words that would "typically" be omitted from 
a key word index and which they'd be willing to share?  If not, does 
anyone know where I might get such a list?

2.  Obtaining the data.  I already have the Canadian news stories but 
our method of retrieving them was less than ideal, and I want to 
improve upon this for follow up studies.  The typical set up for most 
web sites that provide information to the public, and this is true of 
many scientific sites, is that they take the visitor's query and then 
return a list of hits.  These hits are displayed as hyperlinks that 
lead to more detailed underlying information.  You know the story.  The 
problem for researchers is that we're often interested in analyzing all 
of the information that comes up in a query.  So we prefer to be able 
to download all of the data in some raw format, text ideally, go 
through it, clean it up, discard what they don't need, and proceed with 
our research.  Some sites do permit visitors to download entire data 
sets from FTP sites or through more direct database communication, but 
most do not.  Do you think it's feasible to create a kind of web data 
extraction utility in Revolution?  I'm thinking that I would visit a 
news site, for example, enter a query, and then use resulting tags that 
appear in the URL field to reverse engineer the way their database 
handles the queries so that I can automate it in Revolution.  Here's an 
example of one hit from Canada NewsWire drilled down through the 
hyperlinks to an actual story.

http://www.newswire.ca/en/releases/archive/January2005/26/c7010.html

There's the month, year, date, and serial number (c7010) for the story. 
  In Revolution, I would then work with

	repeat ...
		put url (the info above cycled through the serial numbers and dates)
			into a text file
	end repeat

The other piece of information that is needed, and which will vary from 
site to site, the way the serial numbering works.  Any thoughts on this 
approach?  Are there ethical considerations in obtaining information in 
this way?   My guess is no because it simply means going from clicking 
one thousand hyperlinks at a web site to clicking one Revolution button 
to obtain the same number of stories.

	Regards,

		Greg


	Gregory Lypny
	
	Associate Professor of Finance
	John Molson School of Business
	Concordia University
	Montreal, Canada


More information about the use-livecode mailing list