Reading a (BIG) text file one line at a time - in reality...

xbury.cs at clearstream.com xbury.cs at clearstream.com
Wed Nov 24 04:11:05 EST 2004


Richard,

There is also the issue with the buffer size which can significantly 
reduce the
number of file reads (depending on block size, file size, number of words 
or
lines or etc to be parsed)... There's different optimizations for this to 
be considered and usually on case by case basis. It's hard to generalize. 
As a general rule, I try to match the buffer size to a non-fractional size 
of the file so that you dont even have to do the if statement. You gain 
some performance avoiding the use of the if statement checking for file 
end.

To squeeze up your ram, you could create a small (runrev made naturally) 
app that eats ram just by filling in a variable with random data... This 
should be faster than opening all your apps!

Can you expand on 1 and 2? Im not sure what you mean? 

Do you mean 

read from file for a word -- or for a line
 or something like that?

cheers
Xavier

On 24.11.2004 09:27:58 use-revolution-bounces wrote:
>J. Landman Gay wrote:
>> On 11/23/04 10:17 PM, Richard Gaskin wrote:
>>
>>> If any of you have time to improve the buffering method below I'd be
>>> interested in any significant changes to your test results.
>>
>>
>> If we want the buffering method to be as fast as possible, so as to 
test
>> the method itself rather than the script that runs it, then we can 
speed
>> up the script by rewriting method #3 like this:
>>
>>   put the millisecs into t
>>   --
>>   put 0 into tWordCount3
>>   open file tFile for text read
>>   put empty into tBuffer
>>   repeat
>>     read from file tFile for 32000
>>     put tBuffer before it -- stores only 1 line from previous read
>>     if it is empty then exit repeat
>>     if the number of lines in it > 1 then
>>       put last line of it into tBuffer
>>       delete last line of it
>>     else
>>       put empty into tBuffer
>>     end if
>>     --
>>     repeat for each line l in it
>>       add the number of words of l to tWordCount3
>>     end repeat
>>   end repeat
>>   --
>>   put the millisecs - t into t3
>>   close file tFile
>>   --
>>   --
>>
>> This script assumes that the last line in each 32K block is incomplete,
>> which will almost always be the case. If the line isn't incomplete, it
>> doesn't hurt anything to treat it like it is.
>>
>> Problem is, I'm getting a slightly different word count than your
>> original method. I didn't debug that because it's getting late, but it
>> is off by just a few chars and I suspect it has to do with the very 
last
>> line in the file. At any rate, the idea is that the difference in speed
>> is pretty high; in my test the original took about 850 milliseconds and
>> the revised one above took about 125. This would probably change your
>> benchmarks a bit.
>>
>> I added a "close file" command for completeness. If I get a chance, 
I'll
>> try to figure out why my count is off, if someone else doesn't do it 
first.
>
>Good work, Jacque.  I knew there would be a way to change the "repeat
>with" to a "repeat for each", and moving the last line to the buffer and
>walking through "it" instead looks like the way to go.
>
>However, my results differ from yours --  I'm getting an accurate word
>count, but slower speed than before:
>
>200 MB free RAM
>---------------
>Read all:      5.881 secs
>Read buffered: 8.575 secs
>
>Either there was something wrong with the first time I ran the tests, or
>there's something wrong with how I've copied your version in.  And of
>course I have a spare 200MBs of RAM -- got too much to do to go through
>launching all my other apps just to put the squeeze on a test. :)
>
>We still don't know the business specifics of the original poster to
>know if this is at all useful to him, but assuming it will be to others
>down the road the next logical questions are:
>
>1. How can we generalize this so one handler can be used to feed lines
>to another handler for processing?
>
>2. Can we further generalize it to use other chunk types (words, items,
>tokens)?
>
>3. Once we solve #1 and 2, should we request an addition to the engine
>for this?  If you think this is fast now wait till you see what the
>engine can do with it.  It'll be like life before and after the split
>and combine commands.

-----------------------------------------
Visit us at http://www.clearstream.com
IMPORTANT MESSAGE    Internet communications are not secure and therefore
Clearstream International does not accept legal responsibility for the
contents of this message.    The information contained in this e-mail is
confidential and may be legally privileged. It is intended solely for the
addressee. If you are not the intended recipient, any disclosure, copying,
distribution or any action taken or omitted to be taken in reliance on it,
is prohibited and may be unlawful. Any views expressed in this e-mail are
those of the individual sender, except where the sender specifically states
them to be the views of Clearstream International or of any of its
affiliates or subsidiaries.    END OF DISCLAIMER



More information about the use-livecode mailing list