Reading a (BIG) text file one line at a time - in reality...

Richard Gaskin ambassador at fourthworld.com
Tue Nov 23 23:17:56 EST 2004


MisterX wrote:
> Technology may have advanced in OS since the Apple ][ I started with but
> still, VM is still a disk intensive task and more penalizing than you think
> if not done right! Also remember that disk access is slower than ram and
> let's not talk dma/cpuL1/cpuL2 counts of cache hits and misses... But...
> 
> Still a buffered read avoid any paging of memory (which goes back and forth
> to disk) and this saves about 90% of the time processing lines of a file. 

As with most things in computing, it may be that the most accurate 
answer is "depends". ;)  There are few absolutes in computing, with most 
solutions lying somewhere among the trade-offs between RAM and performance.

True, a buffered read is _generally_ faster, and certainly the most 
consistent across varied systems.  But if this is a one-time project is 
the time savings in execution worth the time to script the buffered read 
routine?  Is it a consumer app where you'd have no idea about system 
configuration, or an internal tool where you can run it on the system of 
your choice?  Will most data files be this large, or is this rare in 
most day-to-day uses of the app?

These and other considerations may affect outcomes, as we'll see in the 
benchmark results below.

> Again, I have a fully working free source code example...
> 
> And a blocking bug... 

Since reading files is unrelated to the Geometry Manager issue (which 
can be worked around anyway in far less time than waiting for a fix, as 
discussed earlier), there was nothing to stop me from writing a quick 
benchmark script to try this out.

Before I could run the test I needed some data to test with, so I made a 
file with 100,000 lines, each with a randomly variable number of 
randomly-lengthed words, totalling 110.3 MBs when I generated it (the 
script used to generate it is copied below).

Then I wrote a quickie benchmark script which runs the three 
file-reading methods in question here:

Method 1 - Read all: Uses "put url..." to read the whole file in
                      one move, then uses "repeat for each" to
                      process lines.

Method 2 - Ready by line:  Uses "read from file...until cr", which is
                      the most convenient but of course the slowest
                      since it needs to compare each incoming character
                      as it reads.

Method 3 - Buffered: Uses an off-the-cuff buffering algorithm to grab
                      data from disk in 32k chunks, then finds lines
                      within the buffer.

Since the original question was about processing lines, the goal of all 
three methods was to do something with the contents of each line, so I 
arbitrarily chose to count the words in each line and add them to a 
total.  While a meaningless task in itself, it allowed me to verify the 
accuracy of the methods by comparing the resulting word counts.

The benchmarking script is also copied below, and was run on a 
PowerBook G4 with a single 1 GHz processor and 768 MB total built-in memory.

It should be noted that the buffered-read method I wrote is likely not 
the fastest possible, but was thrown together quickly to get a 
reasonable test result.  I'm sure one of the folks here can improve it, 
but given the overhead inherent in managing a buffer and multiple disk 
accesses the results were roughly in line with what I would expect.

I ran the test three times, each time with more apps open to further 
decrease the amount of available physical memory.  In between each run I 
quit Rev to minimize side-effects from the last run, but note that I did 
not run multiple series in which the order of the tests was changed nor 
did I turn off networking, which may possibly affect outcomes.  I did 
however include code outside of the timing to empty variables and free 
up time for the system to do housekeeping, and checked that all 
background apps were using zero or close to zero CPU time, so the 
results may not be too bad:


200 MB free RAM
---------------
Read all:      5.676 secs
Read by line: 87.298 secs
Read buffered: 7.017 secs


64 MB free RAM
--------------
Read all:      9.548 secs
Read by line: 92.606 secs
Read buffered: 7.601 secs


10 MB free RAM
--------------
Read all:     38.101 secs
Read by line: 94.309 secs
Read buffered: 7.734 secs


This test bears out the strengths and weaknesses of each method relative 
to the system configuration:

- If you have enough RAM for the data, reading the whole file at
   once can be a bit faster.

- If you don't have enough RAM for the whole file but there's enough
   for efficient swapping with VM, the buffered read was only faster
   by 2 secs per 110MBs of data.

- And of course if you starve your system before running such an
   intensive routine (I had to open two 3D apps, GoLive, PhotoShop,
   QuickTime Player, iTunes, and a bunch of others to bring
   available RAM down to 10MB), reading the whole file still beats
   reading each line by a huge margin but the VM swap gives the
   clear performance advantage to the buffered read.

- The buffered read method is the most scalable in terms of its
   relationship to available RAM, but less so in terms of data
   size: on small files the overhead of multiple disk accesses
   shows itself -- here are results on a 64k data file containing
   535 lines:

     Read all:      0.004 secs
     Read by line:  0.051 secs
     Read buffered: 0.011 secs

   I would guess that an optimized buffering routine would bring its
   results for small files more on par with the "read all" method,
   but not likely surpass it.

While fairly unscientific, the test was cost-effective in terms of 
letting me get back to work quickly.  As with the Geometry Manager 
issue, it took longer to run the tests and write about them here than it 
did to write the script. ;)  Carpe diem.

If any of you have time to improve the buffering method below I'd be 
interested in any significant changes to your test results.

--
  Richard Gaskin
  Fourth World Media Corporation
  __________________________________________________
  Rev tools and more: http://www.fourthworld.com/rev

--------------------------------------------------------
--
-- Script for generating random dummy data:
--
on mouseUp
   ask file "Name new test file:" with "TestData.txt"
   if it is empty then exit to top
   put it into tFile
   --
   open file tFile for text write
   repeat with i = 1 to 100000 -- lines
     if i mod 1000 = 0 then
       put i
       set the cursor to busy
     end if
     --
     put empty into tLine
     repeat random(100)+50 -- words, between 50 and 150 per line
       put empty into tWord
       repeat random(20) -- chars, between 1 and 20 per word
         put "x" after tWord
       end repeat
       put tWord & space after tLine
     end repeat
     write tLine & cr to file tFile
   end repeat
   close file tFile
end mouseUp


---------------------------------------------------------
--
-- Script for benchmarking three different methods
-- of reading a large file:

on mouseUp
   answer file "Select a text file:"
   if it is empty then exit to top
   put it into tFile
   --
   --
   -- Method 1: read all
   --
   put "Doing method 1..."
   get empty -- clear "it"
   wait 2 secs with messages -- give system some idle time
   put the millisecs into t
   --
   put 0 into tWordCount1
   put url ("file:"&tFile) into tVar
   repeat for each line tLine in tVar
     add the number of words of tLine to tWordCount1
   end repeat
   --
   put the millisecs - t into t1
   --
   --
   -- Method 2: read by line
   --
   put "Doing method 2..."
   put empty into tVar -- clear var
   get empty
   wait 2 secs with messages
   put the millisecs into t
   --
   put 0 into tWordCount2
   open file tFile for text read
   repeat
     read from file tFile until cr
     if it is empty then exit repeat
     add the number of words of it to tWordCount2
   end repeat
   close file tFile
   --
   put the millisecs - t into t2
   --
   --
   -- Method 3: buffered read
   --
   put "Doing method 3..."
   get empty
   wait 2 secs with messages
   put the millisecs into t
   --
   put 0 into tWordCount3
   open file tFile for text read
   put empty into tBuffer
   repeat
     read from file tFile for 32000
     if it is empty then exit repeat
     put it after tBuffer
     --
     if len(it) < 32000 then -- EOF
       put the number of lines of tBuffer into tNumLines
     else
       --  last line may be incomplete
       put the number of lines of tBuffer - 1 into tNumLines
     end if
     repeat tNumLines
       get line 1 of tBuffer
       add the number of words of it to tWordCount3
       delete line 1 of tBuffer
     end repeat
   end repeat
   --
   put the millisecs - t into t3
   --
   --
   put "Read all: "&t1/1000 &" secs    Read by line: "&t2/1000&\
       " secs    Read buffered: "&t3/1000 &" secs" \
       &cr& tWordCount1 && tWordCount2 && tWordCount3

end mouseUp




More information about the use-livecode mailing list