Chunks vs Arrays - surprising benchmarking results
Richard Gaskin
ambassador at fourthworld.com
Wed Aug 5 15:05:26 EDT 2009
The "Multi dimensional array filtering" thread reminded me of a
benchmarking test I've been wanting to do for some time, and since I
have some tasks coming up on a client project which needs this sort of
stuff it was a good time to dive in.
The goal is simple enough: one of the most common tasks I need to
perform with my data is querying specific fields for criteria, and if
there's a match then assembling the data from a given set of fields for
display in a list to the user.
I've been using simple tab-delimited lists for this data because it was
about as compact as it could be and performs reasonably well. But with
multi-dimensional arrays, the question is whether Rev's fast hash index
into array data might help me gather the data from specific fields in
each record faster than using chunk expressions.
So I took a minute to put together a simple test stack:
<http://fourthworldlabs.com/rev/speed%20test%20chunks%20vs%20arrays.rev.zip>
It has a field containing a list of contact info, another field for
displaying the test results, and a third for displaying the gathered
data from the query so I can verify that it's doing what I want it to.
If you download the stack you'll see that in addition to the "Test"
button there's another one there which converts the list data into an
array and stores that in a custom property of the field, needed for
testing the array method.
The code for the "Test" button is below, and I would appreciate anyone
here who has the time to look it over and see what I may have missed,
because the results I'm getting are not what I expected.
The test script is typical of many of the tasks I need to perform on
this data: it checks one field to see if it contains a value, checks
another to see if it contains a different value, and if both are true it
collects data from three fields into a tab- and return-delimited list so
I can drop it into a list field to display the output.
I had assumed that using chunk expressions to access items in each line
would be slower than using array notation to get them through the hash
in the array. But instead here's the result I'm getting (times are in
milliseconds for 100 iterations):
GetFromList: 72
GetFromSubArray: 752
GetFromMainArray: 407
All results the same?: true
As noted in the code below, the GetFromList handler uses simple chunk
expressions to parse the data; GetFromSubArray uses "repeat for each
element" to parse out the second-tier array within each record;
GetFromMainArray walks through the keys to get the data from the main
array by addressing both dimensions; the last line simply lets me know
that all three are returning the same result.
I can understand why GetFromSubArray is the slowest, since it has to
instantiate an array for the second-tier array each time through the
loop (using "repeat for each element...").
But I had hoped that accessing the main array by specifying the elements
in both dimensions would get to the data more quickly than would be
needed when asking the engine to count items, but apparently not.
Of course there is a scaling issue with chunk expressions. In my sample
data there are only eight items in each record, but if there were
several hundred I would imagine it wouldn't perform as well as the array
methods. But in my case most of the data I work with has fewer than 30
fields and since chunk expressions are measuring about five times faster
I would expect I'd need many more than that before chunk expressions
drop below arrays in relative performance.
The same could be said of the size of the data within each item, since
that will adversely affect the time the engine needs to walk through it
looking for item delimiters. But again, it's not often that I have
field data that's very long (the contact list is a good example, in
which the longest field data is under 200 chars), and the engine's
seeking of delimiters seems reasonably efficient.
Another minor drawback to arrays for this sort of thing is what it does
to the size of the data, esp. if you use meaningful names for your
fields rather than just numbers: while simple tab-delimited data needs
only one set of field names and a function to translate those into item
numbers before dropping into any loop that uses them, arrays replicate
the field names for each record. The longer the field names, the more
impact this will have on data size.
I'd be happy to accept this as a trade-off if the speed justified it,
but in my test the speed benefit just isn't there for this type of
querying task.
Any thoughts on how I might optimize the array functions? Did I just
overlook something obvious which would make them run faster?
--
Richard Gaskin
Fourth World
Revolution training and consulting: http://www.fourthworld.com
Webzine for Rev developers: http://www.revjournal.com
------ code for Test button -------------
on mouseUp
put empty into fld "Results"
put empty into fld "Output"
wait 0 with messages -- so clearing fields is visible
--
-- Number of iterations:
put 100 into n
--
-- Test simple chunk expressions:
put the millisecs into t
repeat n
put GetFromList() into tResult1
end repeat
put the millisecs - t into t1
--
-- Test sub-array method:
put the millisecs into t
repeat n
put GetFromSubArray() into tResult2
end repeat
put the millisecs - t into t2
--
-- Test main array method:
put the millisecs into t
repeat n
put GetFromMainArray() into tResult3
end repeat
put the millisecs - t into t3
--
-- Display results:
put tResult2 into fld "Output"
put "GetFromList: "&t1 &cr\
&"GetFromSubArray: "&t2 &cr\
&"GetFromMainArray: "& t3 &cr\
& "All results the same?: "&\
((tResult1 = tResult2) AND (tResult2 = tResult3)) \
into fld "Results"
end mouseUp
--
-- GetFromList
-- Uses simple item and line chunk expressions to access specific data
-- elements.
--
function GetFromList
put empty into tList
set the itemdel to tab
put fld 1 into tData
repeat for each line tLine in tData
if item 1 of tLine contains "A" and item 4 of tLine contains "4" then
put item 1 of tLine &tab& item 3 of tLine &tab& item 5 of tLine \
&cr after tList
end if
end repeat
delete last char of tList -- trailing CR
sort lines of tList
return tList
end GetFromList
--
-- GetFromSubArray
-- Walks through each element in the two-dimensional array tDataA
-- by putting each into a sub-array tRecordA, then accesses specific
-- elements of tRecordA to get specific field data.
--
function GetFromSubArray
put empty into tList
put the uDataA of fld 1 into tDataA
repeat for each element tRecordA in tDataA
if tRecordA[1] contains "A" and tRecordA[4] contains "4" then
put tRecordA[1] &tab& tRecordA[3] &tab& tRecordA[5] &\
cr after tList
end if
end repeat
delete last char of tList -- trailing CR
sort lines of tList
return tList
end GetFromSubArray
--
-- GetFromMainArray
-- First obtains a list of keys of the main array, and uses these
-- to walk through it, accessing specific elements by that main
-- key and a sub-key.
--
function GetFromMainArray
put empty into tList
put the uDataA of fld 1 into tDataA
repeat for each key tKey in tDataA
if tDataA[tKey][1] contains "A" and tDataA[tKey][4] contains "4" then
put tDataA[tKey][1] &tab& tDataA[tKey][3] &tab&\
tDataA[tKey][5] &cr after tList
end if
end repeat
delete last char of tList -- trailing CR
sort lines of tList
return tList
end GetFromMainArray
###
More information about the use-livecode
mailing list