Text Processing Puzzle
Pierre Sahores
psahores at free.fr
Fri Jul 17 04:48:36 EDT 2009
Hi Gregory,
Is the filesize var well reinited to "0" at the begining of each new
year stories parsing ?
Best,
--
Pierre Sahores
mobile : 06 03 95 77 70
www.sahores-conseil.com
Le 16 juil. 09 à 23:32, Gregory Lypny a écrit :
> Hello everyone,
>
> Sorry for the long message. I'm scratching my head on this one,
> and I'd be interested to know what you think. I'm doing some
> research processing stories released by Canada NewsWire from 1999
> through 2003. I've got one text file of stories for each year, five
> in total. I created a Revolution stack to read these flat files,
> identify where each story begins and ends (see the Sample Story at
> the bottom of this message), and grab the headlines and some other
> information.
>
> What I expected to find is that number of stories would grow year
> by year with the growing popularity of news on the Internet. And
> that is true, except for the last year, 2003, where the number of
> stories is the lowest (see table Stats on the Stories). What
> doesn't make sense is that 2003 is the biggest file at 144 MB. So,
> I figure there must be something in my script that is causing me to
> skip stories in 2003 but I can't find it. I identify the start of
> each story by the five lines like those in the sample that have
>
> cnnw000020011206dxc600795
> 592 Words
> 06 December 2001
> 16:57 GMT
> Canada NewsWire
>
> I've browsed through the 2003 file and the format does not appear to
> have changed. I also replaced line endings for every block of text
> I read in to make sure that isn't messing me up.
>
> replace crlf with return in it
> replace numToChar(13) with return in it
>
> The average number of words per story has remained roughly the same
> for all five years, so how is it that the 2003 file can be roughly
> three times bigger than the 1999 file yet have 4,000 fewer stories!
> What am I missing here?
>
> Regards,
>
> Gregory
>
>
> STATS ON THE STORIES
>
> Year Number of stories Number of words File size (MB)
> 1999 17,653 7,950,395 53.8
> 2000 25,887 13,714,615 92.4
> 2001 32,764 17,996,931 121.3
> 2002 37,403 20,160,555 137
> 2003 13,668 8,341,830 144.2
>
>
> SAMPLE STORY
>
> Factiva (R) Dow Jones & Reuters
> ---------------------------------------------
> Yahoo! Canada en francais launches Shopping Guide
> cnnw000020011206dxc600795
> 592 Words
> 06 December 2001
> 16:57 GMT
> Canada NewsWire
> English
> (Copyright Canada NewsWire 2001)
>
> Search in French, Connect in French and now buy in French on Yahoo!
>
> Canada en francais
>
> Yahoo! Canada en francais - always open
>
> TORONTO, Dec. 6 /CNW/ - Yahoo! Canada en francais today announced
> the launch of a new shopping guide for French speaking Canadian
> consumers. Yahoo! Canada en francais Shopping is an ideal solution
> for francophones who want the convenience of shopping from home,
> plus a variety of shopping options. Shoppers can get started right
> away by going to francais.yahoo.ca Shop now and check out great
> Canadian stores like Compaq Canada, Sony Style, and Camelot, a
> division owned by Archambault.
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
>
More information about the use-livecode
mailing list