Reading a (BIG) text file one line at a time - in reality...

MisterX b.xavier at internet.lu
Tue Nov 23 19:37:34 EST 2004


Richard, everyone...

If I can give you a tip, it's this one... 
It's about "basic" big file parsing...

Technology may have advanced in OS since the Apple ][ I started with but
still, VM is still a disk intensive task and more penalizing than you think
if not done right! Also remember that disk access is slower than ram and
let's not talk dma/cpuL1/cpuL2 counts of cache hits and misses... But...

Still a buffered read avoid any paging of memory (which goes back and forth
to disk) and this saves about 90% of the time processing lines of a file. 

I've tried before to read a HUGOMONGOUS file and then parse it. Forget it! 

Time after time after time, the buffer is faster!

Why?

The pointer travel both in RAM and of any file bigger than your memory will
be slower than accessing a small memory block over and over. I'll give you a
truck metaphor (with a twist) after...

consider "basic" buffered reads 

Repeat
  Read from file x for 32000 
  if length(it) < 32000 then
    we arrived at the end of the file
  else
    repeat for each line l in it
      do something to this line
    end repeat
  end if

This will simply read directly from the current location of the pointer in
the file (with the same exact movement as if you read the whole file) to a
fresh memory buffer (with little pointer moves (line to line)) so there is
less pointer movement to be done.

So the read part is almost the same in terms of file accesing speed. So is
the write... 

If you do read the whole file in memory, and we already know RR is a major
CPU hogger, then you have to parse a huge file (with a moving pointer that
might not exceed RR's limit but that require a lot of paging of memory from
disk to RAM... If you have an enterprise SAN with firewalls, this is even
slower (if you dont know what a SAN is, imagine terabytes of industrial data
with mirrors, backups, and all available 100% of the time across the world -
no downtime permitted! Now, let's talk about cache hits and misses... Nah,
let me give you a metaphor...

You have a gigaton of gravel to send to your factory in timbuktu for
processing into bricks or whatever object turns your clients on...

You have either one hugomonguous truck which takes a gigaton load in one
time, but drives at 3 miles an hour but caries the whole load to timbuktu...
This truck cost you a F1 formula team budget to maintain in running
conditions naturally!

Or you have a small fleet of "normal" faster cheaper trucks to carry this
load in parts (the buffer analogy). Normal trucks are cheaper and more
importantly, you need less trucks! When one is done unloading its load, it
can go back get more... Reuse of resources... Besides, the factory can only
process one truck load at a time (RunRev can't thread!)...

So, despite the fact that RunRev can't always do thread execution time, it's
pointless to try (even if possible) to load all data at once! The time it
will take to "process" it will be increasingly longuer!

Consider that when you do

put line x of thisdatachunk

RR travels (recalculates) from bit 0 to whatever byte line x is "each" time!
This is basic c theory...

Of course you can do it with read line 1, delete line one and make the
memory smaller each time, but each delete line moves a chunk of ram or
creates memory changes which the VM might have to adjust... Cache hits and
misses again...

If this is a 32K buffer, it's easy and fast. Less lines to count... Time
after time (and the file pointer doesn't travel from 0 to possibly 1GB each
time, it just moves 32KB forward and then back to zero). No VM swaps, no
caches, just read, process and write!

In the case of a 1 GB file - the pointer does the whole travel while VM jogs
to find disk space to write to and writes it to and verifies and tells your
program to continue and on and on... It may seem fast on todays computers
but it's definitely NOT effective or economical in any terms... 

It is a good thing that RunRev can do that!

Think it through... Try it...

The compressor (whichever) does it's reading also by buffers... Never loads
it all in memory, it just reads it's dictionary, the file buffer,
translates, then writes. All in buffers. Small buffers equal many reads
(disk is slow so let that be minimized) yet they all follow what ram any
computer could allow - usually not small but not the biggest memory buffer
possible or you'd block the whole computer while you process! 

If you use a disk-block size buffer and align the write to blocks on the
disk, as you would in assembler, then it would be optimal... Still disk
fragmentation might stop you. Hence the good tip to put your swap into a
dedicated partition...

I'll finish with these questions I've kept running into...

In ultraEdit or BBEDit- both very capable hex or any file reader, why does
it take a huge time to read a big file if not buffered (choice in the
prefs)? Scrolling takes also forever in big files... Why? Is a big problem
solved faster as a whole or as parts? Whatever the logic for big things,
parts go faster than wholes!

It's past 1h30 AM and my bed time... 

Again, I have a fully working free source code example...

And a blocking bug... 

RunRev has since delegated my support related questions which is excellent
news in hopes to deblock my situation... Thanks

X

> -----Original Message-----
> From: use-revolution-bounces at lists.runrev.com 
> [mailto:use-revolution-bounces at lists.runrev.com] On Behalf Of 
> Richard Gaskin
> Sent: Tuesday, November 23, 2004 18:13
> To: How to use Revolution
> Subject: Re: Reading a (BIG) text file one line at a time
> 
> Rob Beynon wrote:
> > Greetings all,
> > I have failed to discover how to read a file one line at a 
> time. The 
> > file is a text file, and is large (84MB) but I need to 
> process it on a 
> > line by line basis, rather than read the whole file into a 
> field (not 
> > even sure I can!).
> > 
> > I thought of
> > 
> > read from myFile until EOF for one line
> > 
> > but that doesn't work
> > 
> > Help would be greatly appreciated here! Would you be good enough to 
> > mail to my work address, as I have yet to work out how to 
> access the 
> > list effectively
> 
> Your post made it here, so I'm assuming you worked that out. :)
> 
> The above doesn't work only because you're asking Rev to do 
> two different things, to read until the end of the file AND 
> to read until the end of the next line.  You gotta choose 
> which one you want.  If you want to read the next line use:
> 
>    read from file MyFile until cr
> 
> That assumes the file was opened for text read, causing the 
> engine to automatically translate line endings from 
> platform-specific conventions to the Unix standard 
> line-ending ASCII 10.
> 
> If you file is opened as binary and your line-endings use the 
> Windows convention, use:
> 
>    read from file MyFile until CRLF -- constant for Win line-ending
> 
> But have you tested reading the entire file?  It may sound 
> crazy, and depending on what you want to do with the data it 
> may not be optimal, but I've successfully read much larger 
> files without difficulty in Rev.
> 
> Big files can be slow, depending on available RAM, but in my 
> experience the only platform on which it's a show-stopper is 
> Mac Classic; OS X, Linux, and XP have very efficient memory 
> systems which allow some amazing things if you're used to 
> Classic's limitations.
> 
> Not long ago I had a customer send me a 250MB Gzipped file, 
> and I used it as a test case for Rev's built-in decompess 
> function -- it took a minute or so, but the entire file was 
> successfully decompressed to its original 580MB glory and 
> written back out to disk without error.  When you consider 
> that the decompress function requires both the gzipped and 
> decompressed data to be in memory at the same time, along 
> with everything else the engine and the IDE needs, that's 
> pretty impressive.
> 
> The system I ran that on has only 1GB physical RAM, and I had 
> a few other apps open.  Thank goodness for modern VM systems. :)
> 
> --
>   Richard Gaskin
>   Fourth World Media Corporation
>   __________________________________________________
>   Rev tools and more: http://www.fourthworld.com/rev 
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> http://lists.runrev.com/mailman/listinfo/use-revolution
> 



More information about the use-livecode mailing list