Many Cards Versus One Card and a List Field
Gregory Lypny
gregory.lypny at videotron.ca
Tue Jan 15 21:14:42 EST 2008
Hello Randall,
It's been a while. I could dig it up, but it would be embarrassing
(although the program is still being used). My scripting is like
Johnny Cash's guitar playing was: primitive.
Here's the gist of it, although I'm sure (actually certain) that
there's nothing here that people on the list don't already know. But
let me know if I can clarify anything.
Regards,
Gregory Lypny
Associate Professor of Finance
John Molson School of Business
Concordia University
Montreal, Canada
- The raw sequence data is available in a text files, which are 500 MB
plus. Later projects involved files of more than a GB. These are
what I call flat-file databases in that each record can have a varying
number of sub-records (none to thousands) pertaining to other
variables. Records follow each other in a long stream and are
delimited by characters such as ">>". Variables within records are
labelled at the beginning of each line in capitals followed by a
colon. After that, any line may contain a further breakdown, which
can be delimited any number of ways. Basically a mess, and the files
aren't very useful in themselves for aggregating or doing batch
searches where there would be many hits. I think they were originally
set up (by the NIH? I forget.) to return results for single web-based
queries, kind of like finding a card in HyperCard. I don't think it
was foreseen that some researches might want to submit a batch of
queries (called probe sets) to find whether they had been sequenced.
- Having figured out the record delimiter, it was a question of
reading in the file a bit at a time to index the information about
each variable, that is, break it down and store it in separate files
for searching and extracting later. I experimented with reading in as
many complete records as possible (as opposed to lines or characters)
so as never to mistakenly cut off or lose part of a record, subject to
the constraint that the amount read into what I knew would be the
biggest MetaCard variables (what I called indexes) did not exceed a
certain size in MB because MetaCard would slow down dramatically after
a certain point and, of course, go a slow as molasses if virtual
memory was called upon. I don't have the stats handy, but trial and
error in setting the reading criteria really pays off.
- The MetaCard variables that were used as indexes to store the
various record variables were always simple tab-delimited lists or
arrays that would be later converted to lists. These were always
created using Repeat-for-each-line loops because this type of loop is
very fast. Another thing that increased the speed of indexing was to
dump the contents of the indexes to text files intermittently to free
up memory. Again trial and error was necessary (for me, anyway) to
determine the optimal number of times to write to disk because writing
takes time, so you're balancing write time with variable size. On an
ancient Mac, the one-time indexing process could take about 40 minutes.
- The next step was to build a simple interface so that searches and
extraction of hits could be done. The user imports a list of search
terms (such as probe sets); the appropriate index files are read in
and the lines matched with the submitted queries. Output files that
can be slapped into a spreadsheet are created and stats reported. A
batch query of about 500 probe sets across the 100,000 plus DNA
sequences used to take about three to seven minutes on an old blue
iMac (forgot what those are called).
On Tue, Jan 15, 2008, at 10:47 AM, use-revolution-request at lists.runrev.com
wrote:
> Gregory, do you have a more detailed study of the architecture of
> your dna data solution that you would be willing to share... How
> physically you re storing and manipulating and reporting your data?
>
> randall
More information about the use-livecode
mailing list