Reading a (BIG) text file one line at a time - in reality...

Raymond E. Griffith tiffirgrReverse at ctc.net
Wed Nov 24 13:00:33 EST 2004


> On 11/23/04 10:17 PM, Richard Gaskin wrote:
> 
>> If any of you have time to improve the buffering method below I'd be
>> interested in any significant changes to your test results.
> 
> If we want the buffering method to be as fast as possible, so as to test
> the method itself rather than the script that runs it, then we can speed
> up the script by rewriting method #3 like this:
> 
>    put the millisecs into t
>    --
>    put 0 into tWordCount3
>    open file tFile for text read
>    put empty into tBuffer
>    repeat
>      read from file tFile for 32000
>      put tBuffer before it -- stores only 1 line from previous read
>      if it is empty then exit repeat
>      if the number of lines in it > 1 then
>        put last line of it into tBuffer
>        delete last line of it
>      else
>        put empty into tBuffer
>      end if
>      --
>      repeat for each line l in it
>        add the number of words of l to tWordCount3
>      end repeat
>    end repeat
>    --
>    put the millisecs - t into t3
>    close file tFile
>    --
>    --
> 
> This script assumes that the last line in each 32K block is incomplete,
> which will almost always be the case. If the line isn't incomplete, it
> doesn't hurt anything to treat it like it is.
> 
> Problem is, I'm getting a slightly different word count than your
> original method. I didn't debug that because it's getting late, but it
> is off by just a few chars and I suspect it has to do with the very last
> line in the file. At any rate, the idea is that the difference in speed
> is pretty high; in my test the original took about 850 milliseconds and
> the revised one above took about 125. This would probably change your
> benchmarks a bit.
> 
> I added a "close file" command for completeness. If I get a chance, I'll
> try to figure out why my count is off, if someone else doesn't do it first.

I haven't followed the thread closely, so if I am off base please forgive
me.

>From examining the script, the word count is probably off by just about as
many times as you repeat through the file. You are getting a word count with
each repeat, but in your reading blocks you are not only chopping lines, but
words in two as well.

Another question. I see you are summing the number of words in each line. Is
this really necessary? Perhaps you might just ask for the number of words in
the completed block after you have finished reading it and putting it
together? 

Regards,

Raymond E. Griffith




More information about the use-livecode mailing list