Text Processing Puzzle

Gregory Lypny gregory.lypny at videotron.ca
Thu Jul 16 17:32:24 EDT 2009


	Hello everyone,

	Sorry for the long message.  I'm scratching my head on this one, and  
I'd be interested to know what you think.  I'm doing some research  
processing stories released by Canada NewsWire from 1999 through  
2003.  I've got one text file of stories for each year, five in  
total.  I created a Revolution stack to read these flat files,  
identify where each story begins and ends (see the Sample Story at the  
bottom of this message), and grab the headlines and some other  
information.

	What I expected to find is that number of stories would grow year by  
year with the growing popularity of news on the Internet.  And that is  
true, except for the last year, 2003, where the number of stories is  
the lowest (see table Stats on the Stories).  What doesn't make sense  
is that 2003 is the biggest file at 144 MB.  So, I figure there must  
be something in my script that is causing me to skip stories in 2003  
but I can't find it.  I identify the start of each story by the five  
lines like those in the sample that have

	cnnw000020011206dxc600795
	592 Words
	06 December 2001
	16:57 GMT
	Canada NewsWire

I've browsed through the 2003 file and the format does not appear to  
have changed.  I also replaced line endings for every block of text I  
read in to make sure that isn't messing me up.

	replace crlf with return in it
	replace numToChar(13) with return in it

	The average number of words per story has remained roughly the same  
for all five years, so how is it that the 2003 file can be roughly  
three times bigger than the 1999 file yet have 4,000 fewer stories!   
What am I missing here?

	Regards,

		Gregory


STATS ON THE STORIES

Year		Number of stories		Number of words		File size (MB)
1999	17,653				7,950,395			53.8
2000	25,887				13,714,615			92.4
2001	32,764				17,996,931			121.3
2002	37,403				20,160,555			137
2003	13,668				8,341,830			144.2


SAMPLE STORY

Factiva (R) Dow Jones & Reuters
---------------------------------------------
Yahoo! Canada en francais launches Shopping Guide
cnnw000020011206dxc600795
592 Words
06 December 2001
16:57 GMT
Canada NewsWire
English
(Copyright Canada NewsWire 2001)

Search in French, Connect in French and now buy in French on Yahoo!

Canada en francais

Yahoo! Canada en francais - always open

TORONTO, Dec. 6 /CNW/ - Yahoo! Canada en francais today announced the  
launch of a new shopping guide for French speaking Canadian consumers.  
Yahoo! Canada en francais Shopping is an ideal solution for  
francophones who want the convenience of shopping from home, plus a  
variety of shopping options. Shoppers can get started right away by  
going to francais.yahoo.ca Shop now and check out great Canadian  
stores like Compaq Canada, Sony Style, and Camelot, a division owned  
by Archambault.



More information about the use-livecode mailing list