Reading a (BIG) text file one line at a time - in reality...
Richard Gaskin
ambassador at fourthworld.com
Tue Nov 23 23:17:56 EST 2004
MisterX wrote:
> Technology may have advanced in OS since the Apple ][ I started with but
> still, VM is still a disk intensive task and more penalizing than you think
> if not done right! Also remember that disk access is slower than ram and
> let's not talk dma/cpuL1/cpuL2 counts of cache hits and misses... But...
>
> Still a buffered read avoid any paging of memory (which goes back and forth
> to disk) and this saves about 90% of the time processing lines of a file.
As with most things in computing, it may be that the most accurate
answer is "depends". ;) There are few absolutes in computing, with most
solutions lying somewhere among the trade-offs between RAM and performance.
True, a buffered read is _generally_ faster, and certainly the most
consistent across varied systems. But if this is a one-time project is
the time savings in execution worth the time to script the buffered read
routine? Is it a consumer app where you'd have no idea about system
configuration, or an internal tool where you can run it on the system of
your choice? Will most data files be this large, or is this rare in
most day-to-day uses of the app?
These and other considerations may affect outcomes, as we'll see in the
benchmark results below.
> Again, I have a fully working free source code example...
>
> And a blocking bug...
Since reading files is unrelated to the Geometry Manager issue (which
can be worked around anyway in far less time than waiting for a fix, as
discussed earlier), there was nothing to stop me from writing a quick
benchmark script to try this out.
Before I could run the test I needed some data to test with, so I made a
file with 100,000 lines, each with a randomly variable number of
randomly-lengthed words, totalling 110.3 MBs when I generated it (the
script used to generate it is copied below).
Then I wrote a quickie benchmark script which runs the three
file-reading methods in question here:
Method 1 - Read all: Uses "put url..." to read the whole file in
one move, then uses "repeat for each" to
process lines.
Method 2 - Ready by line: Uses "read from file...until cr", which is
the most convenient but of course the slowest
since it needs to compare each incoming character
as it reads.
Method 3 - Buffered: Uses an off-the-cuff buffering algorithm to grab
data from disk in 32k chunks, then finds lines
within the buffer.
Since the original question was about processing lines, the goal of all
three methods was to do something with the contents of each line, so I
arbitrarily chose to count the words in each line and add them to a
total. While a meaningless task in itself, it allowed me to verify the
accuracy of the methods by comparing the resulting word counts.
The benchmarking script is also copied below, and was run on a
PowerBook G4 with a single 1 GHz processor and 768 MB total built-in memory.
It should be noted that the buffered-read method I wrote is likely not
the fastest possible, but was thrown together quickly to get a
reasonable test result. I'm sure one of the folks here can improve it,
but given the overhead inherent in managing a buffer and multiple disk
accesses the results were roughly in line with what I would expect.
I ran the test three times, each time with more apps open to further
decrease the amount of available physical memory. In between each run I
quit Rev to minimize side-effects from the last run, but note that I did
not run multiple series in which the order of the tests was changed nor
did I turn off networking, which may possibly affect outcomes. I did
however include code outside of the timing to empty variables and free
up time for the system to do housekeeping, and checked that all
background apps were using zero or close to zero CPU time, so the
results may not be too bad:
200 MB free RAM
---------------
Read all: 5.676 secs
Read by line: 87.298 secs
Read buffered: 7.017 secs
64 MB free RAM
--------------
Read all: 9.548 secs
Read by line: 92.606 secs
Read buffered: 7.601 secs
10 MB free RAM
--------------
Read all: 38.101 secs
Read by line: 94.309 secs
Read buffered: 7.734 secs
This test bears out the strengths and weaknesses of each method relative
to the system configuration:
- If you have enough RAM for the data, reading the whole file at
once can be a bit faster.
- If you don't have enough RAM for the whole file but there's enough
for efficient swapping with VM, the buffered read was only faster
by 2 secs per 110MBs of data.
- And of course if you starve your system before running such an
intensive routine (I had to open two 3D apps, GoLive, PhotoShop,
QuickTime Player, iTunes, and a bunch of others to bring
available RAM down to 10MB), reading the whole file still beats
reading each line by a huge margin but the VM swap gives the
clear performance advantage to the buffered read.
- The buffered read method is the most scalable in terms of its
relationship to available RAM, but less so in terms of data
size: on small files the overhead of multiple disk accesses
shows itself -- here are results on a 64k data file containing
535 lines:
Read all: 0.004 secs
Read by line: 0.051 secs
Read buffered: 0.011 secs
I would guess that an optimized buffering routine would bring its
results for small files more on par with the "read all" method,
but not likely surpass it.
While fairly unscientific, the test was cost-effective in terms of
letting me get back to work quickly. As with the Geometry Manager
issue, it took longer to run the tests and write about them here than it
did to write the script. ;) Carpe diem.
If any of you have time to improve the buffering method below I'd be
interested in any significant changes to your test results.
--
Richard Gaskin
Fourth World Media Corporation
__________________________________________________
Rev tools and more: http://www.fourthworld.com/rev
--------------------------------------------------------
--
-- Script for generating random dummy data:
--
on mouseUp
ask file "Name new test file:" with "TestData.txt"
if it is empty then exit to top
put it into tFile
--
open file tFile for text write
repeat with i = 1 to 100000 -- lines
if i mod 1000 = 0 then
put i
set the cursor to busy
end if
--
put empty into tLine
repeat random(100)+50 -- words, between 50 and 150 per line
put empty into tWord
repeat random(20) -- chars, between 1 and 20 per word
put "x" after tWord
end repeat
put tWord & space after tLine
end repeat
write tLine & cr to file tFile
end repeat
close file tFile
end mouseUp
---------------------------------------------------------
--
-- Script for benchmarking three different methods
-- of reading a large file:
on mouseUp
answer file "Select a text file:"
if it is empty then exit to top
put it into tFile
--
--
-- Method 1: read all
--
put "Doing method 1..."
get empty -- clear "it"
wait 2 secs with messages -- give system some idle time
put the millisecs into t
--
put 0 into tWordCount1
put url ("file:"&tFile) into tVar
repeat for each line tLine in tVar
add the number of words of tLine to tWordCount1
end repeat
--
put the millisecs - t into t1
--
--
-- Method 2: read by line
--
put "Doing method 2..."
put empty into tVar -- clear var
get empty
wait 2 secs with messages
put the millisecs into t
--
put 0 into tWordCount2
open file tFile for text read
repeat
read from file tFile until cr
if it is empty then exit repeat
add the number of words of it to tWordCount2
end repeat
close file tFile
--
put the millisecs - t into t2
--
--
-- Method 3: buffered read
--
put "Doing method 3..."
get empty
wait 2 secs with messages
put the millisecs into t
--
put 0 into tWordCount3
open file tFile for text read
put empty into tBuffer
repeat
read from file tFile for 32000
if it is empty then exit repeat
put it after tBuffer
--
if len(it) < 32000 then -- EOF
put the number of lines of tBuffer into tNumLines
else
-- last line may be incomplete
put the number of lines of tBuffer - 1 into tNumLines
end if
repeat tNumLines
get line 1 of tBuffer
add the number of words of it to tWordCount3
delete line 1 of tBuffer
end repeat
end repeat
--
put the millisecs - t into t3
--
--
put "Read all: "&t1/1000 &" secs Read by line: "&t2/1000&\
" secs Read buffered: "&t3/1000 &" secs" \
&cr& tWordCount1 && tWordCount2 && tWordCount3
end mouseUp
More information about the use-livecode
mailing list