Chunks vs Arrays - surprising benchmarking results

Richard Gaskin ambassador at fourthworld.com
Wed Aug 5 15:05:26 EDT 2009


The "Multi dimensional array filtering" thread reminded me of a 
benchmarking test I've been wanting to do for some time, and since I 
have some tasks coming up on a client project which needs this sort of 
stuff it was a good time to dive in.

The goal is simple enough:  one of the most common tasks I need to 
perform with my data is querying specific fields for criteria, and if 
there's a match then assembling the data from a given set of fields for 
display in a list to the user.

I've been using simple tab-delimited lists for this data because it was 
about as compact as it could be and performs reasonably well.  But with 
multi-dimensional arrays, the question is whether Rev's fast hash index 
into array data might help me gather the data from specific fields in 
each record faster than using chunk expressions.

So I took a minute to put together a simple test stack:
<http://fourthworldlabs.com/rev/speed%20test%20chunks%20vs%20arrays.rev.zip>

It has a field containing a list of contact info, another field for 
displaying the test results, and a third for displaying the gathered 
data from the query so I can verify that it's doing what I want it to.

If you download the stack you'll see that in addition to the "Test" 
button there's another one there which converts the list data into an 
array and stores that in a custom property of the field, needed for 
testing the array method.

The code for the "Test" button is below, and I would appreciate anyone 
here who has the time to look it over and see what I may have missed, 
because the results I'm getting are not what I expected.

The test script is typical of many of the tasks I need to perform on 
this data:  it checks one field to see if it contains a value, checks 
another to see if it contains a different value, and if both are true it 
collects data from three fields into a tab- and return-delimited list so 
I can drop it into a list field to display the output.

I had assumed that using chunk expressions to access items in each line 
would be slower than using array notation to get them through the hash 
in the array.  But instead here's the result I'm getting (times are in 
milliseconds for 100 iterations):

GetFromList: 72
GetFromSubArray: 752
GetFromMainArray: 407
All results the same?: true

As noted in the code below, the GetFromList handler uses simple chunk 
expressions to parse the data; GetFromSubArray uses "repeat for each 
element" to parse out the second-tier array within each record; 
GetFromMainArray walks through the keys to get the data from the main 
array by addressing both dimensions; the last line simply lets me know 
that all three are returning the same result.

I can understand why GetFromSubArray is the slowest, since it has to 
instantiate an array for the second-tier array each time through the 
loop (using "repeat for each element...").

But I had hoped that accessing the main array by specifying the elements 
in both dimensions would get to the data more quickly than would be 
needed when asking the engine to count items, but apparently not.

Of course there is a scaling issue with chunk expressions.  In my sample 
data there are only eight items in each record, but if there were 
several hundred I would imagine it wouldn't perform as well as the array 
methods.  But in my case most of the data I work with has fewer than 30 
fields and since chunk expressions are measuring about five times faster 
   I would expect I'd need many more than that before chunk expressions 
drop below arrays in relative performance.

The same could be said of the size of the data within each item, since 
that will adversely affect the time the engine needs to walk through it 
looking for item delimiters.  But again, it's not often that I have 
field data that's very long (the contact list is a good example, in 
which the longest field data is under 200 chars), and the engine's 
seeking of delimiters seems reasonably efficient.

Another minor drawback to arrays for this sort of thing is what it does 
to the size of the data, esp. if you use meaningful names for your 
fields rather than just numbers:  while simple tab-delimited data needs 
only one set of field names and a function to translate those into item 
numbers before dropping into any loop that uses them, arrays replicate 
the field names for each record.  The longer the field names, the more 
impact this will have on data size.

I'd be happy to accept this as a trade-off if the speed justified it, 
but in my test the speed benefit just isn't there for this type of 
querying task.

Any thoughts on how I might optimize the array functions?  Did I just 
overlook something obvious which would make them run faster?

--
  Richard Gaskin
  Fourth World
  Revolution training and consulting: http://www.fourthworld.com
  Webzine for Rev developers: http://www.revjournal.com


------ code for Test button -------------

on mouseUp
   put empty into fld "Results"
   put empty into fld "Output"
   wait 0 with messages -- so clearing fields is visible
   --
   -- Number of iterations:
   put 100 into n
   --
   -- Test simple chunk expressions:
   put the millisecs into t
   repeat n
     put GetFromList() into tResult1
   end repeat
   put the millisecs - t into t1
   --
   -- Test sub-array method:
   put the millisecs into t
   repeat n
     put GetFromSubArray() into tResult2
   end repeat
   put the millisecs - t into t2
   --
   -- Test main array method:
   put the millisecs into t
   repeat n
     put GetFromMainArray() into tResult3
   end repeat
   put the millisecs - t into t3
   --
   -- Display results:
   put tResult2 into fld "Output"
   put "GetFromList: "&t1 &cr\
       &"GetFromSubArray: "&t2 &cr\
       &"GetFromMainArray: "& t3 &cr\
       & "All results the same?: "&\
          ((tResult1 = tResult2) AND (tResult2 = tResult3)) \
       into fld "Results"
end mouseUp


--
-- GetFromList
-- Uses simple item and line chunk expressions to access specific data
-- elements.
--
function GetFromList
   put empty into tList
   set the itemdel to tab
   put fld 1 into tData
   repeat for each line tLine in tData
     if item 1 of tLine contains "A" and item 4 of tLine contains "4" then
       put item 1 of tLine &tab& item 3 of tLine &tab& item 5 of tLine \
        &cr after tList
     end if
   end repeat
   delete last char of tList -- trailing CR
   sort lines of tList
   return tList
end GetFromList


--
-- GetFromSubArray
-- Walks through each element in the two-dimensional array tDataA
-- by putting each into a sub-array tRecordA, then accesses specific
-- elements of tRecordA to get specific field data.
--
function GetFromSubArray
   put empty into tList
   put the uDataA of fld 1 into tDataA
   repeat for each element tRecordA in tDataA
     if tRecordA[1] contains "A" and tRecordA[4] contains "4" then
       put tRecordA[1] &tab& tRecordA[3] &tab& tRecordA[5] &\
        cr after tList
     end if
   end repeat
   delete last char of tList -- trailing CR
   sort lines of tList
   return tList
end GetFromSubArray


--
-- GetFromMainArray
-- First obtains a list of keys of the main array, and uses these
-- to walk through it, accessing specific elements by that main
-- key and a sub-key.
--
function GetFromMainArray
   put empty into tList
   put the uDataA of fld 1 into tDataA
   repeat for each key tKey in tDataA
     if tDataA[tKey][1] contains "A" and tDataA[tKey][4] contains "4" then
       put tDataA[tKey][1] &tab& tDataA[tKey][3] &tab&\
         tDataA[tKey][5] &cr after tList
     end if
   end repeat
   delete last char of tList -- trailing CR
   sort lines of tList
   return tList
end GetFromMainArray

###



More information about the use-livecode mailing list