Text Processing Puzzle

Pierre Sahores psahores at free.fr
Fri Jul 17 04:48:36 EDT 2009


Hi Gregory,

Is the filesize var well reinited to "0" at the begining of each new  
year stories parsing ?

Best,
--
Pierre Sahores
mobile : 06 03 95 77 70
www.sahores-conseil.com


Le 16 juil. 09 à 23:32, Gregory Lypny a écrit :

> 	Hello everyone,
>
> 	Sorry for the long message.  I'm scratching my head on this one,  
> and I'd be interested to know what you think.  I'm doing some  
> research processing stories released by Canada NewsWire from 1999  
> through 2003.  I've got one text file of stories for each year, five  
> in total.  I created a Revolution stack to read these flat files,  
> identify where each story begins and ends (see the Sample Story at  
> the bottom of this message), and grab the headlines and some other  
> information.
>
> 	What I expected to find is that number of stories would grow year  
> by year with the growing popularity of news on the Internet.  And  
> that is true, except for the last year, 2003, where the number of  
> stories is the lowest (see table Stats on the Stories).  What  
> doesn't make sense is that 2003 is the biggest file at 144 MB.  So,  
> I figure there must be something in my script that is causing me to  
> skip stories in 2003 but I can't find it.  I identify the start of  
> each story by the five lines like those in the sample that have
>
> 	cnnw000020011206dxc600795
> 	592 Words
> 	06 December 2001
> 	16:57 GMT
> 	Canada NewsWire
>
> I've browsed through the 2003 file and the format does not appear to  
> have changed.  I also replaced line endings for every block of text  
> I read in to make sure that isn't messing me up.
>
> 	replace crlf with return in it
> 	replace numToChar(13) with return in it
>
> 	The average number of words per story has remained roughly the same  
> for all five years, so how is it that the 2003 file can be roughly  
> three times bigger than the 1999 file yet have 4,000 fewer stories!   
> What am I missing here?
>
> 	Regards,
>
> 		Gregory
>
>
> STATS ON THE STORIES
>
> Year		Number of stories		Number of words		File size (MB)
> 1999	17,653				7,950,395			53.8
> 2000	25,887				13,714,615			92.4
> 2001	32,764				17,996,931			121.3
> 2002	37,403				20,160,555			137
> 2003	13,668				8,341,830			144.2
>
>
> SAMPLE STORY
>
> Factiva (R) Dow Jones & Reuters
> ---------------------------------------------
> Yahoo! Canada en francais launches Shopping Guide
> cnnw000020011206dxc600795
> 592 Words
> 06 December 2001
> 16:57 GMT
> Canada NewsWire
> English
> (Copyright Canada NewsWire 2001)
>
> Search in French, Connect in French and now buy in French on Yahoo!
>
> Canada en francais
>
> Yahoo! Canada en francais - always open
>
> TORONTO, Dec. 6 /CNW/ - Yahoo! Canada en francais today announced  
> the launch of a new shopping guide for French speaking Canadian  
> consumers. Yahoo! Canada en francais Shopping is an ideal solution  
> for francophones who want the convenience of shopping from home,  
> plus a variety of shopping options. Shoppers can get started right  
> away by going to francais.yahoo.ca Shop now and check out great  
> Canadian stores like Compaq Canada, Sony Style, and Camelot, a  
> division owned by Archambault.
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your  
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
>





More information about the use-livecode mailing list