Analyzing News Stories
Gregory Lypny
gregory.lypny at videotron.ca
Wed Jan 26 10:40:21 EST 2005
Thanks for your replies, Jonathan and Alex.
Jonathan,
I don't know what an RSS feed is, but if it refers to new stories or
accessing news in real time, that's not what I'm doing. I want to tap
into complete archives.
> Most news outlets are moving to RSS feeds...
>
> You could probably set up Rev to continuously monitor an RSS feed, and
> pull out, save, and categorieze those stories that you need.
Alex,
> Index every word on a random sample of 100 stories. Eliminate any word
> that appears in more than 80% of them. Look briefly at those words that
> appear in between 50% and 80% and see what you think about them; if
> necessary, adjust the thresholds until it feels right for your
> purposes.
Easy enough and intuitive. Thanks.
>
> Do you want to index each word separately, or try to accumulate common
> roots; e.g. cause, causes, caused, causing, causation ... one entry, 2
> entries, 5 entries?
>
One entry for starters.
> I'd worry about whether I had deduced the serial numbering scheme
> fully.
> Did I get every story ? Could there be any particular kind of story
> that
> was indexed differently (e.g. stories printed straight form the AP wire
> might be indexed differently from those written, or extensively
> modified, by the paper's own writers).
Yes, exactly. I thought about that. It'll require some tinkering.
>
> There are some ethical issues about collecting large amounts of data;
> you should, at a minimum, read up on the content of the robots.txt
> system, and in general conform to the site's requests as described in
> their robots.txt files.
Thanks. I didn't know about robots policies. So, I'll request that
information.
Greg
More information about the use-livecode
mailing list