[RevServer tips] Spreading the load or why wise developers use asynchronous workflows

Richard Gaskin ambassador at fourthworld.com
Wed Aug 4 15:28:44 EDT 2010


wayne durden wrote:

> This is all very interesting to me because I am interested in moving a
> desktop app that processes datafiles up to 100,000 lines which can mean for
> each line comparing against the remainder (in reality sorts cust this down a
> great deal), but this can run for minutes on a desktop app and I have got to
> cut it down into asynchronous processing as per your article...

I don't know the specifics of your data or your needs, but lately I've 
been experimenting with a variety of different ways to store data, and 
I've found that for many tasks using column-based storage over row-based 
storage can speed up searches and comparisons by orders of magnitude.

This is where the old acronyms OLAP and OLTP come in, with the "A" being 
"access" (analytics, data mining; mostly read operations) and "T" being 
"transaction" (posting as well as reading).  That's an 
oversimplification, but spending some time following those links out in 
Wikipedia from those can lead to all sorts of different ways to store 
and index data for task-specific needs which can radically reduce CPU 
and RAM consumption.

For example, if you had a data set in which you had 300,000 address 
records stored in eight fields, you could store them in eight files in 
which each stores only the values for a given column.  Finding addresses 
in zip code would then no longer need to traverse the whole data set and 
parse each line, but merely pick up the one file for zip codes and 
"repeat for each" with those.  Any columns you're not interested in for 
a given search are left on disk and take up zero RAM.

Then there are other things one can add in, like cardinal indexing of 
column values for one-step searches across data sets of any size.

Quick example using the zip code exercise again:  You write an indexer 
that runs through the data set and produces a stack file in which each 
of the custom property keys of the stack is a zip code, and the value of 
each property is a list of the ID numbers of all the records that have 
that zip code.

With that index you can now search in one step:

   get the uZipCodes["90031"] of stack "ZipIndex.rev"

...and you have an instant list of the ID of every record with that zip 
code.

How to get the data once you've found those IDs?

There are an infinite number of ways to store data, but if you used even 
just simple tab-delimited files you'd be surprised how quickly you can 
get to what you want using the seek command if you write an index first.

Such a master index could also be a simple list of properties in a stack 
(by far the most efficient way to load persistent arrays in Rev, much 
faster than arrayDecode), in which each element key is the ID number of 
the record and each value is just two lines:  the byte offset to the 
start of the record, and the length of the record.

With that relatively small index you can get any record anywhere in even 
a giant file in four lines:

    open file tMyDataFile for read
    seek to tRecordStart in file tMyDataFile
    read from file tMyDataFile for tRecordLength
    close file tMyDataFile

On my slow Mac here I can use that to pull a record out of a 500 MB file 
containing 300,000 records in about 50 MICROseconds.

Since an index for a file like that will take only a few MBs it can be 
loaded in no time, and the seek command doesn't load the whole data file 
into RAM so the only memory consumption for getting the record is just 
the record itself + the index + the engine's normal overhead.

Combined with the cardinal indexing described above and you can slice 
and dice data any number of ways really quickly.

Of course this is only suited for OLAP-style tasks, dependent on the 
data not changing frequently so it can be worthwhile indexing it without 
the indexing adding more overhead than it's worth.  FWIW, on my slow Mac 
I can write the master index and two or three columnar cardinal indices 
in well under a minutes.

For all sorts of task in which data is read far more frequently than 
written, you can use methods like this to get ultra-fast results with 
minimal resource consumption.

If the data on the server is not modified there but merely used as a 
data repository for your searches, you could do the indexing tasks on 
your desktop and just upload the index stacks to your server along with 
a copy of the file.  The server load will always be minimal, and you can 
do some relatively massive tasks well under even most shared hosting limits.

Of course you could also use MySQL, CouchDB, or any number of other 
off-the-shelf solutions for much of this, but for some tasks you may 
find you can write an indexer and retriever faster in Rev than you could 
dig up the syntax to do it in another language. :)


WARNING:  Once you start exploring indexing techniques you may become 
addicted; you will find yourself daydreaming about new methods at odd 
hours of the day, and time formerly spent with the family will suddenly 
become spent on the web learning even better methods.  You may find 
yourself thinking about ways to use Rev's union and intersect commands 
on results from index searches to implement even complex AND and OR 
queries in one step.  Turning data inside out can cause your mind to 
cave in on itself, and worse you make like it.  You have been warned.

--
  Richard Gaskin
  Fourth World
  Rev training and consulting: http://www.fourthworld.com
  Webzine for Rev developers: http://www.revjournal.com
  revJournal blog: http://revjournal.com/blog.irv



More information about the use-livecode mailing list