[RevServer tips] Spreading the load or why wise developers use asynchronous workflows
ambassador at fourthworld.com
Wed Aug 4 14:28:44 CDT 2010
wayne durden wrote:
> This is all very interesting to me because I am interested in moving a
> desktop app that processes datafiles up to 100,000 lines which can mean for
> each line comparing against the remainder (in reality sorts cust this down a
> great deal), but this can run for minutes on a desktop app and I have got to
> cut it down into asynchronous processing as per your article...
I don't know the specifics of your data or your needs, but lately I've
been experimenting with a variety of different ways to store data, and
I've found that for many tasks using column-based storage over row-based
storage can speed up searches and comparisons by orders of magnitude.
This is where the old acronyms OLAP and OLTP come in, with the "A" being
"access" (analytics, data mining; mostly read operations) and "T" being
"transaction" (posting as well as reading). That's an
oversimplification, but spending some time following those links out in
Wikipedia from those can lead to all sorts of different ways to store
and index data for task-specific needs which can radically reduce CPU
and RAM consumption.
For example, if you had a data set in which you had 300,000 address
records stored in eight fields, you could store them in eight files in
which each stores only the values for a given column. Finding addresses
in zip code would then no longer need to traverse the whole data set and
parse each line, but merely pick up the one file for zip codes and
"repeat for each" with those. Any columns you're not interested in for
a given search are left on disk and take up zero RAM.
Then there are other things one can add in, like cardinal indexing of
column values for one-step searches across data sets of any size.
Quick example using the zip code exercise again: You write an indexer
that runs through the data set and produces a stack file in which each
of the custom property keys of the stack is a zip code, and the value of
each property is a list of the ID numbers of all the records that have
that zip code.
With that index you can now search in one step:
get the uZipCodes["90031"] of stack "ZipIndex.rev"
...and you have an instant list of the ID of every record with that zip
How to get the data once you've found those IDs?
There are an infinite number of ways to store data, but if you used even
just simple tab-delimited files you'd be surprised how quickly you can
get to what you want using the seek command if you write an index first.
Such a master index could also be a simple list of properties in a stack
(by far the most efficient way to load persistent arrays in Rev, much
faster than arrayDecode), in which each element key is the ID number of
the record and each value is just two lines: the byte offset to the
start of the record, and the length of the record.
With that relatively small index you can get any record anywhere in even
a giant file in four lines:
open file tMyDataFile for read
seek to tRecordStart in file tMyDataFile
read from file tMyDataFile for tRecordLength
close file tMyDataFile
On my slow Mac here I can use that to pull a record out of a 500 MB file
containing 300,000 records in about 50 MICROseconds.
Since an index for a file like that will take only a few MBs it can be
loaded in no time, and the seek command doesn't load the whole data file
into RAM so the only memory consumption for getting the record is just
the record itself + the index + the engine's normal overhead.
Combined with the cardinal indexing described above and you can slice
and dice data any number of ways really quickly.
Of course this is only suited for OLAP-style tasks, dependent on the
data not changing frequently so it can be worthwhile indexing it without
the indexing adding more overhead than it's worth. FWIW, on my slow Mac
I can write the master index and two or three columnar cardinal indices
in well under a minutes.
For all sorts of task in which data is read far more frequently than
written, you can use methods like this to get ultra-fast results with
minimal resource consumption.
If the data on the server is not modified there but merely used as a
data repository for your searches, you could do the indexing tasks on
your desktop and just upload the index stacks to your server along with
a copy of the file. The server load will always be minimal, and you can
do some relatively massive tasks well under even most shared hosting limits.
Of course you could also use MySQL, CouchDB, or any number of other
off-the-shelf solutions for much of this, but for some tasks you may
find you can write an indexer and retriever faster in Rev than you could
dig up the syntax to do it in another language. :)
WARNING: Once you start exploring indexing techniques you may become
addicted; you will find yourself daydreaming about new methods at odd
hours of the day, and time formerly spent with the family will suddenly
become spent on the web learning even better methods. You may find
yourself thinking about ways to use Rev's union and intersect commands
on results from index searches to implement even complex AND and OR
queries in one step. Turning data inside out can cause your mind to
cave in on itself, and worse you make like it. You have been warned.
Rev training and consulting: http://www.fourthworld.com
Webzine for Rev developers: http://www.revjournal.com
revJournal blog: http://revjournal.com/blog.irv
More information about the use-livecode