Reading a (BIG) text file one line at a time

Gregory Lypny gregory.lypny at videotron.ca
Wed Nov 24 09:51:50 EST 2004


Hello everyone,

I've benefited immeasurably from the thoughtful comments of everyone on 
this list, so I'm throwing in my two cents for you to critique.  The 
following is a primitive handler (because I'm a primitive scripter) I 
wrote in MetaCard a couple of years ago for reading and processing 
large text files.

The project was to index Unigene data files.  These are flat-file 
database files from the Human Genome project, and they are roughly 450 
to 500 MB.  The way I processed a file is simply by reading it in a 
record at a time.  Records vary from a few lines to many hundreds of 
lines, with an unknown number of variables in each record, and are 
delimited by ">>" or "//".  The variable recordDelimiter is set 
accordingly.  The data from the read is put into thisRecord and 
processed.  Processing involves breaking down each record into its 
variables and writing them to separate index files, effectively 
creating a relational database.

What made a big difference in speed for both processing the data was 
how frequently variables in which processed data was accumulating were 
written to index files on disk and then emptied before continuing.  
Likewise, in searching and extracting information from the resulting 
index files, speed was enhanced by experimenting with the amount of 
data read as a searched progressed.  Perhaps, in your case, it is best 
to read in many lines at a time and process them all together.  It 
would not be too difficult to write a script that adjusts the number of 
lines that it reads in and processes as it goes along to find an 
optimum.  Reading and writing often reduces speed, but processing "too" 
much data at a time does too.  I haven't had an opportunity to study 
Richard, Jacqueline, and Xavier's posts on buffers, but I'm guessing 
that they're dealing with the optimal use of memory in this regard.

Below the handler is an example of one of my logs from indexing that 
was done on a modest 300 MHz G3 iBook.  Processing a 483 MB file took 
29 minutes.  This may seem like a long time, but it only had to be done 
once.  After that, searches are done on the index files.  Scroll down 
and you'll see another log: this one from one of the searches on index 
files.  A search for 2,065 genetic probes and the merger and summary of 
related data from the index files took 1.3 minutes.  This isn't too 
shabby considering that most scientific web sites either restrict users 
to one query at a time (imagine submitting 2,065 queries), or, if they 
permit batch queries, there is little choice in the format of output or 
merge files.  This is all to say that we're pleased.


  open file filePath for read
   repeat
     read from file filePath until recordDelimiter
     put the result into resultOfRead
     put it into thisRecord
-- START PROCESSING AS DESIRED
  -- Script for processing thisRecord goes here.
  -- Monitor the variables or containers in which you allow data to 
accumulate,
-- so that they don't get too big and slow down performance.  
Experiment to
-- find the optimal number of times to write accumulated data to disk 
files.
     if resultOfRead is not empty then exit repeat   -- We're at the 
end-of-file (eof)
     end repeat	
   close file filePath

Here is a sample log from one of the runs to index the source file.
Index created: Fri, Oct 24, 2003 11:44:04 AM
Target file: "UniGene Human 23 Oct 2003.txt"
Index set name (folder): "UniGene Human 23 Oct 2003 Index"
Index key (first column of every index file): "ID"
Number of records in source: 127,835
Number of variables found: 12
Total indexing time: about 28.9 minutes
Data processed: 483.01 MB
Indexing performance: 279.02 KB per second

Here is a sample log from one of the searches run on the index files 
that were created.
Your data file: "Human Probes 2065 x 68.txt"
Target file: "Consolidated SEQUENCE.txt"
Target set (folder): "UniGene Human 23 Oct 2003 Index"
Extraction set (folder): "Human Extraction"
Observations (lines) searched: 127,835
Hits: 1,631 (chunks found)
Misses: 360 (chunks not found)
Total unique chunks: 1,991
Fuzzy hits: not flagged for extractions from consolidated files
Records extracted: 1,562 (records where chunks were found)
Times data file read: 43
Average read size: 1.03 MB bytes
Average lines per read: 2,973
Total bytes read: 44.34 MB
Extraction time: about 1.3 minutes

	Regards,

	Gregory Lypny
	__________________________
	Associate Professor of Finance
	John Molson School of Business
	Concordia University
	Montreal, Canada



More information about the use-livecode mailing list