Reading a (BIG) text file one line at a time

xbury.cs at clearstream.com xbury.cs at clearstream.com
Wed Nov 24 10:06:53 EST 2004


Gregory...

Excellent tip regarding optimizing...

This is quite akin to genetic algorithms (GA) in AI where a 
change/mutation is 
induced in the factors that drive the function to find the best 
combination among many. 

Your approach has the slight problem that GA avoids. 

If you find a better factor between two worse factors (imagine an 
irregular sine wave), 
you may have the best factor between two points in the curve but maybe not 
the best 
point in the whole curve. Give more than one factor (the buffer size), 
this becomes easily
 an exhaustive search for the highest mountain in a 3D mountain graph.

And GA usually never satisfies itself but keeps mutating one or the other 
factor
in case there is a better factor combination. 

Just a thought you induced with the genome search and optimizing the 
buffer size!

GA in RunRev are probably over kill. Like Neural Networks, they are better 
relayed
to an external or a separate application via shell or applescripts.

cheers
Xavier

On 24.11.2004 15:51:50 use-revolution-bounces wrote:
>Hello everyone,
>
>I've benefited immeasurably from the thoughtful comments of everyone on
>this list, so I'm throwing in my two cents for you to critique.  The
>following is a primitive handler (because I'm a primitive scripter) I
>wrote in MetaCard a couple of years ago for reading and processing
>large text files.
>
>The project was to index Unigene data files.  These are flat-file
>database files from the Human Genome project, and they are roughly 450
>to 500 MB.  The way I processed a file is simply by reading it in a
>record at a time.  Records vary from a few lines to many hundreds of
>lines, with an unknown number of variables in each record, and are
>delimited by ">>" or "//".  The variable recordDelimiter is set
>accordingly.  The data from the read is put into thisRecord and
>processed.  Processing involves breaking down each record into its
>variables and writing them to separate index files, effectively
>creating a relational database.
>
>What made a big difference in speed for both processing the data was
>how frequently variables in which processed data was accumulating were
>written to index files on disk and then emptied before continuing.
>Likewise, in searching and extracting information from the resulting
>index files, speed was enhanced by experimenting with the amount of
>data read as a searched progressed.  Perhaps, in your case, it is best
>to read in many lines at a time and process them all together.  It
>would not be too difficult to write a script that adjusts the number of
>lines that it reads in and processes as it goes along to find an
>optimum.  Reading and writing often reduces speed, but processing "too"
>much data at a time does too.  I haven't had an opportunity to study
>Richard, Jacqueline, and Xavier's posts on buffers, but I'm guessing
>that they're dealing with the optimal use of memory in this regard.
>
>Below the handler is an example of one of my logs from indexing that
>was done on a modest 300 MHz G3 iBook.  Processing a 483 MB file took
>29 minutes.  This may seem like a long time, but it only had to be done
>once.  After that, searches are done on the index files.  Scroll down
>and you'll see another log: this one from one of the searches on index
>files.  A search for 2,065 genetic probes and the merger and summary of
>related data from the index files took 1.3 minutes.  This isn't too
>shabby considering that most scientific web sites either restrict users
>to one query at a time (imagine submitting 2,065 queries), or, if they
>permit batch queries, there is little choice in the format of output or
>merge files.  This is all to say that we're pleased.
>
>
>open file filePath for read
>repeat
>read from file filePath until recordDelimiter
>put the result into resultOfRead
>put it into thisRecord
>-- START PROCESSING AS DESIRED
>-- Script for processing thisRecord goes here.
>-- Monitor the variables or containers in which you allow data to
>accumulate,
>-- so that they don't get too big and slow down performance.
>Experiment to
>-- find the optimal number of times to write accumulated data to disk
>files.
>if resultOfRead is not empty then exit repeat   -- We're at the
>end-of-file (eof)
>end repeat
>close file filePath
>
>Here is a sample log from one of the runs to index the source file.
>Index created: Fri, Oct 24, 2003 11:44:04 AM
>Target file: "UniGene Human 23 Oct 2003.txt"
>Index set name (folder): "UniGene Human 23 Oct 2003 Index"
>Index key (first column of every index file): "ID"
>Number of records in source: 127,835
>Number of variables found: 12
>Total indexing time: about 28.9 minutes
>Data processed: 483.01 MB
>Indexing performance: 279.02 KB per second
>
>Here is a sample log from one of the searches run on the index files
>that were created.
>Your data file: "Human Probes 2065 x 68.txt"
>Target file: "Consolidated SEQUENCE.txt"
>Target set (folder): "UniGene Human 23 Oct 2003 Index"
>Extraction set (folder): "Human Extraction"
>Observations (lines) searched: 127,835
>Hits: 1,631 (chunks found)
>Misses: 360 (chunks not found)
>Total unique chunks: 1,991
>Fuzzy hits: not flagged for extractions from consolidated files
>Records extracted: 1,562 (records where chunks were found)
>Times data file read: 43
>Average read size: 1.03 MB bytes
>Average lines per read: 2,973
>Total bytes read: 44.34 MB
>Extraction time: about 1.3 minutes
>
>Regards,
>
>Gregory Lypny
>__________________________
>Associate Professor of Finance
>John Molson School of Business
>Concordia University
>Montreal, Canada
>
>_______________________________________________
>use-revolution mailing list
>use-revolution at lists.runrev.com
>http://lists.runrev.com/mailman/listinfo/use-revolution

-----------------------------------------
Visit us at http://www.clearstream.com
IMPORTANT MESSAGE    Internet communications are not secure and therefore
Clearstream International does not accept legal responsibility for the
contents of this message.    The information contained in this e-mail is
confidential and may be legally privileged. It is intended solely for the
addressee. If you are not the intended recipient, any disclosure, copying,
distribution or any action taken or omitted to be taken in reliance on it,
is prohibited and may be unlawful. Any views expressed in this e-mail are
those of the individual sender, except where the sender specifically states
them to be the views of Clearstream International or of any of its
affiliates or subsidiaries.    END OF DISCLAIMER



More information about the use-livecode mailing list