Text Processing Puzzle
Gregory Lypny
gregory.lypny at videotron.ca
Thu Jul 16 17:32:24 EDT 2009
Hello everyone,
Sorry for the long message. I'm scratching my head on this one, and
I'd be interested to know what you think. I'm doing some research
processing stories released by Canada NewsWire from 1999 through
2003. I've got one text file of stories for each year, five in
total. I created a Revolution stack to read these flat files,
identify where each story begins and ends (see the Sample Story at the
bottom of this message), and grab the headlines and some other
information.
What I expected to find is that number of stories would grow year by
year with the growing popularity of news on the Internet. And that is
true, except for the last year, 2003, where the number of stories is
the lowest (see table Stats on the Stories). What doesn't make sense
is that 2003 is the biggest file at 144 MB. So, I figure there must
be something in my script that is causing me to skip stories in 2003
but I can't find it. I identify the start of each story by the five
lines like those in the sample that have
cnnw000020011206dxc600795
592 Words
06 December 2001
16:57 GMT
Canada NewsWire
I've browsed through the 2003 file and the format does not appear to
have changed. I also replaced line endings for every block of text I
read in to make sure that isn't messing me up.
replace crlf with return in it
replace numToChar(13) with return in it
The average number of words per story has remained roughly the same
for all five years, so how is it that the 2003 file can be roughly
three times bigger than the 1999 file yet have 4,000 fewer stories!
What am I missing here?
Regards,
Gregory
STATS ON THE STORIES
Year Number of stories Number of words File size (MB)
1999 17,653 7,950,395 53.8
2000 25,887 13,714,615 92.4
2001 32,764 17,996,931 121.3
2002 37,403 20,160,555 137
2003 13,668 8,341,830 144.2
SAMPLE STORY
Factiva (R) Dow Jones & Reuters
---------------------------------------------
Yahoo! Canada en francais launches Shopping Guide
cnnw000020011206dxc600795
592 Words
06 December 2001
16:57 GMT
Canada NewsWire
English
(Copyright Canada NewsWire 2001)
Search in French, Connect in French and now buy in French on Yahoo!
Canada en francais
Yahoo! Canada en francais - always open
TORONTO, Dec. 6 /CNW/ - Yahoo! Canada en francais today announced the
launch of a new shopping guide for French speaking Canadian consumers.
Yahoo! Canada en francais Shopping is an ideal solution for
francophones who want the convenience of shopping from home, plus a
variety of shopping options. Shoppers can get started right away by
going to francais.yahoo.ca Shop now and check out great Canadian
stores like Compaq Canada, Sony Style, and Camelot, a division owned
by Archambault.
More information about the use-livecode
mailing list