Chunks vs Arrays - surprising benchmarking results
Paul Looney
support at ahsomme.com
Wed Aug 5 18:52:05 EDT 2009
Richard,
I have nothing to add directly to the chunk vs array discussion
(Trevor's reply was very good) but I have often found it helpful to
increase the speed of compound selections by breaking them into
individual ones.
For instance if you have a large database of names and sexes and you
want to select every female named "Jan" ("Jan" could be male or female).
Select all of the Jans first (this will run much faster than the
compound selection).
Then select all of the females from the result of the first selection
(this will run faster because it is searching only "Jan"s - a very
small list).
This double selection will run faster than a single compound selection.
Obviously this requires a known data-set where one filter will
eliminate a lot of records (selecting "female", then selecting "Jan"
would be much slower in our example because, presumably, half of the
list is female and a small portion is Jan).
On many lists this can create a much bigger speed difference than
chunk vs array variance you noted.
Paul Looney
On Aug 5, 2009, at 12:05 PM, Richard Gaskin wrote:
> The "Multi dimensional array filtering" thread reminded me of a
> benchmarking test I've been wanting to do for some time, and since
> I have some tasks coming up on a client project which needs this
> sort of stuff it was a good time to dive in.
>
> The goal is simple enough: one of the most common tasks I need to
> perform with my data is querying specific fields for criteria, and
> if there's a match then assembling the data from a given set of
> fields for display in a list to the user.
>
> I've been using simple tab-delimited lists for this data because it
> was about as compact as it could be and performs reasonably well.
> But with multi-dimensional arrays, the question is whether Rev's
> fast hash index into array data might help me gather the data from
> specific fields in each record faster than using chunk expressions.
>
> So I took a minute to put together a simple test stack:
> <http://fourthworldlabs.com/rev/speed%20test%20chunks%20vs%
> 20arrays.rev.zip>
>
> It has a field containing a list of contact info, another field for
> displaying the test results, and a third for displaying the
> gathered data from the query so I can verify that it's doing what I
> want it to.
>
> If you download the stack you'll see that in addition to the "Test"
> button there's another one there which converts the list data into
> an array and stores that in a custom property of the field, needed
> for testing the array method.
>
> The code for the "Test" button is below, and I would appreciate
> anyone here who has the time to look it over and see what I may
> have missed, because the results I'm getting are not what I expected.
>
> The test script is typical of many of the tasks I need to perform
> on this data: it checks one field to see if it contains a value,
> checks another to see if it contains a different value, and if both
> are true it collects data from three fields into a tab- and return-
> delimited list so I can drop it into a list field to display the
> output.
>
> I had assumed that using chunk expressions to access items in each
> line would be slower than using array notation to get them through
> the hash in the array. But instead here's the result I'm getting
> (times are in milliseconds for 100 iterations):
>
> GetFromList: 72
> GetFromSubArray: 752
> GetFromMainArray: 407
> All results the same?: true
>
> As noted in the code below, the GetFromList handler uses simple
> chunk expressions to parse the data; GetFromSubArray uses "repeat
> for each element" to parse out the second-tier array within each
> record; GetFromMainArray walks through the keys to get the data
> from the main array by addressing both dimensions; the last line
> simply lets me know that all three are returning the same result.
>
> I can understand why GetFromSubArray is the slowest, since it has
> to instantiate an array for the second-tier array each time through
> the loop (using "repeat for each element...").
>
> But I had hoped that accessing the main array by specifying the
> elements in both dimensions would get to the data more quickly than
> would be needed when asking the engine to count items, but
> apparently not.
>
> Of course there is a scaling issue with chunk expressions. In my
> sample data there are only eight items in each record, but if there
> were several hundred I would imagine it wouldn't perform as well as
> the array methods. But in my case most of the data I work with has
> fewer than 30 fields and since chunk expressions are measuring
> about five times faster I would expect I'd need many more than
> that before chunk expressions drop below arrays in relative
> performance.
>
> The same could be said of the size of the data within each item,
> since that will adversely affect the time the engine needs to walk
> through it looking for item delimiters. But again, it's not often
> that I have field data that's very long (the contact list is a good
> example, in which the longest field data is under 200 chars), and
> the engine's seeking of delimiters seems reasonably efficient.
>
> Another minor drawback to arrays for this sort of thing is what it
> does to the size of the data, esp. if you use meaningful names for
> your fields rather than just numbers: while simple tab-delimited
> data needs only one set of field names and a function to translate
> those into item numbers before dropping into any loop that uses
> them, arrays replicate the field names for each record. The longer
> the field names, the more impact this will have on data size.
>
> I'd be happy to accept this as a trade-off if the speed justified
> it, but in my test the speed benefit just isn't there for this type
> of querying task.
>
> Any thoughts on how I might optimize the array functions? Did I
> just overlook something obvious which would make them run faster?
>
> --
> Richard Gaskin
> Fourth World
> Revolution training and consulting: http://www.fourthworld.com
> Webzine for Rev developers: http://www.revjournal.com
>
>
> ------ code for Test button -------------
>
> on mouseUp
> put empty into fld "Results"
> put empty into fld "Output"
> wait 0 with messages -- so clearing fields is visible
> --
> -- Number of iterations:
> put 100 into n
> --
> -- Test simple chunk expressions:
> put the millisecs into t
> repeat n
> put GetFromList() into tResult1
> end repeat
> put the millisecs - t into t1
> --
> -- Test sub-array method:
> put the millisecs into t
> repeat n
> put GetFromSubArray() into tResult2
> end repeat
> put the millisecs - t into t2
> --
> -- Test main array method:
> put the millisecs into t
> repeat n
> put GetFromMainArray() into tResult3
> end repeat
> put the millisecs - t into t3
> --
> -- Display results:
> put tResult2 into fld "Output"
> put "GetFromList: "&t1 &cr\
> &"GetFromSubArray: "&t2 &cr\
> &"GetFromMainArray: "& t3 &cr\
> & "All results the same?: "&\
> ((tResult1 = tResult2) AND (tResult2 = tResult3)) \
> into fld "Results"
> end mouseUp
>
>
> --
> -- GetFromList
> -- Uses simple item and line chunk expressions to access specific data
> -- elements.
> --
> function GetFromList
> put empty into tList
> set the itemdel to tab
> put fld 1 into tData
> repeat for each line tLine in tData
> if item 1 of tLine contains "A" and item 4 of tLine contains
> "4" then
> put item 1 of tLine &tab& item 3 of tLine &tab& item 5 of
> tLine \
> &cr after tList
> end if
> end repeat
> delete last char of tList -- trailing CR
> sort lines of tList
> return tList
> end GetFromList
>
>
> --
> -- GetFromSubArray
> -- Walks through each element in the two-dimensional array tDataA
> -- by putting each into a sub-array tRecordA, then accesses specific
> -- elements of tRecordA to get specific field data.
> --
> function GetFromSubArray
> put empty into tList
> put the uDataA of fld 1 into tDataA
> repeat for each element tRecordA in tDataA
> if tRecordA[1] contains "A" and tRecordA[4] contains "4" then
> put tRecordA[1] &tab& tRecordA[3] &tab& tRecordA[5] &\
> cr after tList
> end if
> end repeat
> delete last char of tList -- trailing CR
> sort lines of tList
> return tList
> end GetFromSubArray
>
>
> --
> -- GetFromMainArray
> -- First obtains a list of keys of the main array, and uses these
> -- to walk through it, accessing specific elements by that main
> -- key and a sub-key.
> --
> function GetFromMainArray
> put empty into tList
> put the uDataA of fld 1 into tDataA
> repeat for each key tKey in tDataA
> if tDataA[tKey][1] contains "A" and tDataA[tKey][4] contains
> "4" then
> put tDataA[tKey][1] &tab& tDataA[tKey][3] &tab&\
> tDataA[tKey][5] &cr after tList
> end if
> end repeat
> delete last char of tList -- trailing CR
> sort lines of tList
> return tList
> end GetFromMainArray
>
> ###
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
More information about the use-livecode
mailing list