Many Cards Versus One Card and a List Field

Gregory Lypny gregory.lypny at videotron.ca
Tue Jan 15 21:14:42 EST 2008


Hello Randall,

It's been a while.  I could dig it up, but it would be embarrassing  
(although the program is still being used).  My scripting is like  
Johnny Cash's guitar playing was: primitive.

Here's the gist of it, although I'm sure (actually certain) that  
there's nothing here that people on the list don't already know.  But  
let me know if I can clarify anything.

Regards,

Gregory Lypny

Associate Professor of Finance
John Molson School of Business
Concordia University
Montreal, Canada

- The raw sequence data is available in a text files, which are 500 MB  
plus.  Later projects involved files of more than a GB.  These are  
what I call flat-file databases in that each record can have a varying  
number of sub-records (none to thousands) pertaining to other  
variables.  Records follow each other in a long stream and are  
delimited by characters such as ">>".  Variables within records are  
labelled at the beginning of each line in capitals followed by a  
colon.  After that, any line may contain a further breakdown, which  
can be delimited any number of ways.  Basically a mess, and the files  
aren't very useful in themselves for aggregating or doing batch  
searches where there would be many hits.  I think they were originally  
set up (by the NIH?  I forget.) to return results for single web-based  
queries, kind of like finding a card in HyperCard.  I don't think it  
was foreseen that some researches might want to submit a batch of  
queries (called probe sets) to find whether they had been sequenced.

- Having figured out the record delimiter, it was a question of  
reading in the file a bit at a time to index the information about  
each variable, that is, break it down and store it in separate files  
for searching and extracting later.  I experimented with reading in as  
many complete records as possible (as opposed to lines or characters)  
so as never to mistakenly cut off or lose part of a record, subject to  
the constraint that the amount read into what I knew would be the  
biggest MetaCard variables (what I called indexes) did not exceed a  
certain size in MB because MetaCard would slow down dramatically after  
a certain point and, of course, go a slow as molasses if virtual  
memory was called upon.  I don't have the stats handy, but trial and  
error in setting the reading criteria really pays off.

- The MetaCard variables that were used as indexes to store the  
various record variables were always simple tab-delimited lists or  
arrays that would be later converted to lists.  These were always  
created using Repeat-for-each-line loops because this type of loop is  
very fast.  Another thing that increased the speed of indexing was to  
dump the contents of the indexes to text files intermittently to free  
up memory.  Again trial and error was necessary (for me, anyway) to  
determine the optimal number of times to write to disk because writing  
takes time, so you're balancing write time with variable size.  On an  
ancient Mac, the one-time indexing process could take about 40 minutes.

- The next step was to build a simple interface so that searches and  
extraction of hits could be done.  The user imports a list of search  
terms (such as probe sets); the appropriate index files are read in  
and the lines matched with the submitted queries.  Output files that  
can be slapped into a spreadsheet are created and stats reported.  A  
batch query of about 500 probe sets across the 100,000 plus DNA  
sequences used to take about three to seven minutes on an old blue  
iMac (forgot what those are called).





On Tue, Jan 15, 2008, at 10:47 AM, use-revolution-request at lists.runrev.com 
  wrote:

> Gregory, do you have a more detailed study of the architecture of  
> your dna data solution that you would be willing to share... How  
> physically you re storing and manipulating and reporting your data?
>
> randall




More information about the use-livecode mailing list