Reading a (BIG) text file one line at a time
Gregory Lypny
gregory.lypny at videotron.ca
Wed Nov 24 09:51:50 EST 2004
Hello everyone,
I've benefited immeasurably from the thoughtful comments of everyone on
this list, so I'm throwing in my two cents for you to critique. The
following is a primitive handler (because I'm a primitive scripter) I
wrote in MetaCard a couple of years ago for reading and processing
large text files.
The project was to index Unigene data files. These are flat-file
database files from the Human Genome project, and they are roughly 450
to 500 MB. The way I processed a file is simply by reading it in a
record at a time. Records vary from a few lines to many hundreds of
lines, with an unknown number of variables in each record, and are
delimited by ">>" or "//". The variable recordDelimiter is set
accordingly. The data from the read is put into thisRecord and
processed. Processing involves breaking down each record into its
variables and writing them to separate index files, effectively
creating a relational database.
What made a big difference in speed for both processing the data was
how frequently variables in which processed data was accumulating were
written to index files on disk and then emptied before continuing.
Likewise, in searching and extracting information from the resulting
index files, speed was enhanced by experimenting with the amount of
data read as a searched progressed. Perhaps, in your case, it is best
to read in many lines at a time and process them all together. It
would not be too difficult to write a script that adjusts the number of
lines that it reads in and processes as it goes along to find an
optimum. Reading and writing often reduces speed, but processing "too"
much data at a time does too. I haven't had an opportunity to study
Richard, Jacqueline, and Xavier's posts on buffers, but I'm guessing
that they're dealing with the optimal use of memory in this regard.
Below the handler is an example of one of my logs from indexing that
was done on a modest 300 MHz G3 iBook. Processing a 483 MB file took
29 minutes. This may seem like a long time, but it only had to be done
once. After that, searches are done on the index files. Scroll down
and you'll see another log: this one from one of the searches on index
files. A search for 2,065 genetic probes and the merger and summary of
related data from the index files took 1.3 minutes. This isn't too
shabby considering that most scientific web sites either restrict users
to one query at a time (imagine submitting 2,065 queries), or, if they
permit batch queries, there is little choice in the format of output or
merge files. This is all to say that we're pleased.
open file filePath for read
repeat
read from file filePath until recordDelimiter
put the result into resultOfRead
put it into thisRecord
-- START PROCESSING AS DESIRED
-- Script for processing thisRecord goes here.
-- Monitor the variables or containers in which you allow data to
accumulate,
-- so that they don't get too big and slow down performance.
Experiment to
-- find the optimal number of times to write accumulated data to disk
files.
if resultOfRead is not empty then exit repeat -- We're at the
end-of-file (eof)
end repeat
close file filePath
Here is a sample log from one of the runs to index the source file.
Index created: Fri, Oct 24, 2003 11:44:04 AM
Target file: "UniGene Human 23 Oct 2003.txt"
Index set name (folder): "UniGene Human 23 Oct 2003 Index"
Index key (first column of every index file): "ID"
Number of records in source: 127,835
Number of variables found: 12
Total indexing time: about 28.9 minutes
Data processed: 483.01 MB
Indexing performance: 279.02 KB per second
Here is a sample log from one of the searches run on the index files
that were created.
Your data file: "Human Probes 2065 x 68.txt"
Target file: "Consolidated SEQUENCE.txt"
Target set (folder): "UniGene Human 23 Oct 2003 Index"
Extraction set (folder): "Human Extraction"
Observations (lines) searched: 127,835
Hits: 1,631 (chunks found)
Misses: 360 (chunks not found)
Total unique chunks: 1,991
Fuzzy hits: not flagged for extractions from consolidated files
Records extracted: 1,562 (records where chunks were found)
Times data file read: 43
Average read size: 1.03 MB bytes
Average lines per read: 2,973
Total bytes read: 44.34 MB
Extraction time: about 1.3 minutes
Regards,
Gregory Lypny
__________________________
Associate Professor of Finance
John Molson School of Business
Concordia University
Montreal, Canada
More information about the use-livecode
mailing list