Chunks vs Arrays - surprising benchmarking results

Richard Gaskin ambassador at fourthworld.com
Fri Aug 7 14:15:58 EDT 2009


Kevin Miller wrote:

> On 05/08/2009 21:02, "Richard Gaskin" <ambassador at fourthworld.com> wrote:
> 
>> Excellent sleuthing, Trevor.   Confirmed:  with that change I'm getting
>> the same results.  Who would have thought there could be so much
>> overhead moving a custom property array into a variable array?
> 
> Bear in mind that when retrieving a custom property, the engine has to look
> up whether there is a getProp handler for it. Locking messages should give
> you an improvement.

In my tests here the difference was measurable but minor, but then again 
I have no getProp handlers in the message path for these tests.


> The main difference between arrays and chunks is that arrays will continue
> to access huge data sets in linear time, whereas chunks (depending on
> exactly what you're doing) will slow down as the data set gets larger.

Very true, though the difference becomes most significant only when the 
data is large enough that it may be a good time to consider SQLite over 
stack files for storage anyway. :)


In my tests I ran three sets of data:

small: 5000 records, with 50 items in each record, with 50 chars in each 
item

medium: 10,000 records with 100 items in each record, with 100 chars in 
each item

large: 10,000 records with 100 items in each record, with 200 chars in 
each item

I put both the delimited and array data into the same stack for each, 
giving me a size for the small stack of about 27MB, medium was about 
204MB, and large was over 408MB.

The small data set performed well with both methods, as did medium 
although the medium stack took some time to load (confirming my hunch 
that 100MB of data is probably a good boundary to consider using SQLite 
over stack files, but fortunately I'm using this just for document files 
so it's unlikely I'll ever reach even half that).

The large data set could be created and saved, but the resulting stack 
could not be opened; no corruption warning, just wouldn't open.  Have I 
discovered an undocumented limitation?

The results were as we would expect: as the data grows in size, 
performance of the array search method scales linearly, while 
performance for chunk expressions degrade in logarithmic proportion to 
data length.

Even so, chunk expressions consistently outperformed arrays in tests 
which included loading the data from properties.

When I altered the test to preload the data into vars before testing, 
the difference in performance was just under an order of magnitude 
greater in favor of arrays.

While this means changing my setup to preload data when documents are 
first opened, this one-time hit is more than compensated by ongoing 
performance enhancement for nearly all other operations.


So I'm strongly favoring arrays for this sort of thing, but it would be 
nice to have three enhancements in the engine to make it even better:

1. faster load time
-------------------
Can the operation which moves data from array custom props into array 
variables be optimized to reduce the surprising amount of time it takes? 
Grabbing a non-array property is nearly as fast as accessing a global; 
it'd be nice if accessing array props were at least a bit closer to that 
stellar performance.

2. operate directly on properties
---------------------------------
It would be very handy if we could use the same array syntax to work 
with properties as we can with variables.  Before multi-dimensional 
arrays there was an enjoyable, learnable, and efficient parity in the 
syntax used for arrays in both vars and props, and I miss that when 
working with nested arrays.


3. reduce data redundancy in keys
---------------------------------
Given that Rev's arrays are associative every element is a name-value 
pair, so in addition to storing the value it needs to also store the 
name as its key.  This is necessary because for all the engines knows 
every array may contain unique keys, but when making nested arrays in 
which the inner arrays are all uniform the replicated key names just 
take up space.

For example, with the Congress contact info I used in my original 
example, it's only 530 lines with less than 1/2k per line, so tucking 
that into a property in a new stack gives me a size for that stack file 
of about 68k.

But when I make an array version of that data and store that into a 
property in another stack, using meaningful names for those elements 
(e.g., "Name", "Address", "Telephone", etc.) brings that stack size up 
to 132k - more than double.

So I was daydreaming:  what if we could tell the engine that a given 
array is to be populated only with sub-arrays in which all keys will 
always be the same?

Imagine being able to define something like a struct, a definition of an 
array which could be assigned to another array so that parent array 
would be able to store the data more efficiently, tucking only the data 
into its nifty hash without having to replicate the field names for 
every element.

I would imagine such a struct-like thing would have a great many uses, 
in addition to reducing memory and storage requirements for uniform 
arrays as elements of a parent array.

Doable?  By Tuesday, perhaps? :)

--
  Richard Gaskin
  Fourth World
  Revolution training and consulting: http://www.fourthworld.com
  Webzine for Rev developers: http://www.revjournal.com



More information about the use-livecode mailing list