Analyzing News Stories

MisterX b.xavier at internet.lu
Wed Jan 26 10:43:07 EST 2005


Greg,

Part of my XOS project is to do with knowledge management or data mining
(both have overlapping zones of usage). But basically it allows me to
regroup disparate data and reorganize it nearly automatically - the keywords
and links are automatically detected and the exceptions are already quite
long... Furthermore, there's keyword hyperlinking and category trees to
cross-reference, classify or nagivate across data.

Unfortunately I haven't found many who are interested in getting deeper into
developping this - so i've been doing it alone for some years but it is
quite more evolved than anything i've seen so far.

You can find a template stack and (somewhat raw) explanations of XOS at 
http://www.monsieurx.com/modules.php?name=News&file=article&sid=166

This is mini-xos all in one stack - the real application depends on plugin
modules which you add to folders and distributed datas which can be
anywhere. Each tab you see in the application is actually just a template. I
got dozens of templates which I can copy paste across any new xos database
stack which reduces common programming by about 90% in some cases ;)

I have a webextraction tool like you mention in another template and lastly
you can also check discretebrowser which takes web viewing another step up
with better context analysis (keywords, media, links to name a few already
present).

Adding more features or combining the discrete browser into xos is just copy
paste BTW! Few changes would be required and then linking cards to a
database since stacks are a bit limited in data amounts. 

Check it out. For the list of common ignored keywords, go in the Options tab
you will see ignore list below on the left. Although you can add them
manually, you can do so with a couple clicks using the keywords interface
which is present in each object/record in the book/database. If a keyword
has too many cross references it must be either a context root keyword or a
to be ignored keyword. 

cheers
Xavier
--
http://monsieurx.com


> -----Original Message-----
> From: use-revolution-bounces at lists.runrev.com 
> [mailto:use-revolution-bounces at lists.runrev.com] On Behalf Of 
> Gregory Lypny
> Sent: Wednesday, January 26, 2005 15:49
> To: Revolution
> Subject: Analyzing News Stories
> 
> Hello everyone,
> 
> I'm starting a research project that will relate the flow of 
> information, in the form of news reports from Reuters and 
> Canada NewsWire, to activity in capital markets, namely, the 
> volume and volatility of trade on the stock market.  The 
> Canadian news sample will consist of about 240,000 stories 
> going back to 1995.  I have two questions about processing 
> the news data with Revolution.
> 
> 1.  Creating an index of key words.  I will use arrays to 
> create indexes of key words from the headlines and the 
> stories.  There are many words that must be kept out of the 
> indexes: a, the, about, there, this, etc.  The list of these 
> is probably longer than I can imagine.  
> Does anyone have a list of words that would "typically" be 
> omitted from a key word index and which they'd be willing to 
> share?  If not, does anyone know where I might get such a list?
> 
> 2.  Obtaining the data.  I already have the Canadian news 
> stories but our method of retrieving them was less than 
> ideal, and I want to improve upon this for follow up studies. 
>  The typical set up for most web sites that provide 
> information to the public, and this is true of many 
> scientific sites, is that they take the visitor's query and 
> then return a list of hits.  These hits are displayed as 
> hyperlinks that lead to more detailed underlying information. 
>  You know the story.  The problem for researchers is that 
> we're often interested in analyzing all of the information 
> that comes up in a query.  So we prefer to be able to 
> download all of the data in some raw format, text ideally, go 
> through it, clean it up, discard what they don't need, and 
> proceed with our research.  Some sites do permit visitors to 
> download entire data sets from FTP sites or through more 
> direct database communication, but most do not.  Do you think 
> it's feasible to create a kind of web data extraction utility 
> in Revolution?  I'm thinking that I would visit a news site, 
> for example, enter a query, and then use resulting tags that 
> appear in the URL field to reverse engineer the way their 
> database handles the queries so that I can automate it in 
> Revolution.  Here's an example of one hit from Canada 
> NewsWire drilled down through the hyperlinks to an actual story.
> 
> http://www.newswire.ca/en/releases/archive/January2005/26/c7010.html
> 
> There's the month, year, date, and serial number (c7010) for 
> the story. 
>   In Revolution, I would then work with
> 
> 	repeat ...
> 		put url (the info above cycled through the 
> serial numbers and dates)
> 			into a text file
> 	end repeat
> 
> The other piece of information that is needed, and which will 
> vary from site to site, the way the serial numbering works.  
> Any thoughts on this approach?  Are there ethical 
> considerations in obtaining information in 
> this way?   My guess is no because it simply means going from 
> clicking 
> one thousand hyperlinks at a web site to clicking one 
> Revolution button to obtain the same number of stories.
> 
> 	Regards,
> 
> 		Greg
> 
> 
> 	Gregory Lypny
> 	
> 	Associate Professor of Finance
> 	John Molson School of Business
> 	Concordia University
> 	Montreal, Canada
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> http://lists.runrev.com/mailman/listinfo/use-revolution
> 



More information about the use-livecode mailing list