Analyzing News Stories

Gregory Lypny gregory.lypny at videotron.ca
Wed Jan 26 10:40:21 EST 2005


Thanks for your replies, Jonathan and Alex.

Jonathan,

I don't know what an RSS feed is, but if it refers to new stories or 
accessing news in real time, that's not what I'm doing.  I want to tap 
into complete archives.

> Most news outlets are moving to RSS feeds...
>
> You could probably set up Rev to continuously monitor an RSS feed, and
> pull out, save, and categorieze those stories that you need.



Alex,

> Index every word on a random sample of 100 stories. Eliminate any word
> that appears in more than 80% of them. Look briefly at those words that
> appear in between 50% and 80% and see what you think about them; if
> necessary, adjust the thresholds until it feels right for your 
> purposes.

	Easy enough and intuitive.  Thanks.
>
> Do you want to index each word separately, or try to accumulate common
> roots; e.g. cause, causes, caused, causing, causation ... one entry, 2
> entries, 5 entries?
>

	One entry for starters.

> I'd worry about whether I had deduced the serial numbering scheme 
> fully.
> Did I get every story ? Could there be any particular kind of story 
> that
> was indexed differently (e.g. stories printed straight form the AP wire
> might be indexed differently from those written, or extensively
> modified, by the paper's own writers).

	Yes, exactly.  I thought about that.  It'll require some tinkering.
>
> There are some ethical issues about collecting large amounts of data;
> you should, at a minimum, read up on the content of the robots.txt
> system, and in general conform to the site's requests as described in
> their robots.txt files.

	Thanks.  I didn't know about robots policies.  So, I'll request that 
information.

	Greg





More information about the use-livecode mailing list