There should be a "unique" option on sort . . .

Richard Gaskin ambassador at fourthworld.com
Tue Jan 7 09:29:37 EST 2014


Geoff, I think you may have stumbled across the greatest challenge of 
benchmarking:  the subtle differences over which we have little or no 
control.

Many factors can affect performance, including differences in chip 
architectures with regard to instruction set features, caching, etc., 
system memory caching, background processes, and much more.  There have 
even been times when I've seen changing the order in which testing 
functions are called can affect outcomes.  And file I/O tests often 
benefit from a cold boot due to system caching - sometimes worked around 
running purge, but still tedious at best.

When we see stark differences like the ones between your and my results 
running the same code in the same version of LiveCode (I'm using 6.5.1), 
I think we've moved outside of LiveCode and are now at the mercy of 
hardware and software nuances beyond our control.

If nothing else it reminds us of the value of testing comparative 
benchmarks across multiple machines.

But one fairly constant thing in all this is that when we can hand off 
processing to the compiled C++ code in the engine, in many cases (Regex 
and others notwithstanding) we can expect a boost in performance.

I would be interested to see the results if you swap out the original 
array function with this one, from 
<http://lists.runrev.com/pipermail/use-livecode/2014-January/197038.html>:

function UniqueListFromArray3 pData
     set the caseSensitive to true
     split pData using cr and cr
     put the keys of pData into tKeys
     sort lines of tKeys
     return tKeys
end UniqueListFromArray3

Under the hood, the split command presumably does a series of steps very 
similar to what we'd have to do in script to build an array, but without 
the overhead of type coersion and other factors that characterize 
dynamic compilation, and using an algo written very specifically for 
that task.  That is, it still needs to hash every key and move the 
contents into the appropriate bucket, but with far less overhead than 
running those steps through the interpreter.

The only downside to using split is that it differs from most other 
array-building methods in that it's always case-sensitive (see the 
enhancement request to allow split to use the caseSensitive property 
here: <http://quality.runrev.com/show_bug.cgi?id=11651>).

So to pass the sanity check, you'll need to add this line to the other 
function as well:

    set the caseSensitive to true

I'll wager the beverage of your choice at RevLive in San Diego this 
September that the split method will be faster even on your system.

--
  Richard Gaskin
  Fourth World
  LiveCode training and consulting: http://www.fourthworld.com
  Webzine for LiveCode developers: http://www.LiveCodeJournal.com
  Follow me on Twitter:  http://twitter.com/FourthWorldSys



Geoff Canyon wrote:

> I *think* I tested up to 100,000 lines -- spent the day traveling back from
> Boston to St. Louis, a little groggy -- *but* my keys were always integers.
>
> On Mon, Jan 6, 2014 at 10:13 AM, Richard Gaskin
> <ambassador at fourthworld.com>wrote:
>
>> function UniqueListFromChunks pData
>>    sort lines of pData
>>    # put line 1 of pData is false into tLastLine -- what does this do?
>>    put empty into tLastLine
>>    repeat for each line tLine in pData
>>       if tLine is tLastLine then next repeat
>>
>
> NOTE: I corrected the variable names in the above.
>
> Ha ;-) on any day other than yesterday would have done what you did, or
> skipped the initialization altogether. BUT -- if the first line of data in
> pData is empty (your version) or tLastLine (mine), the above will
> (incorrectly) omit it. So the line I gave, "put line 1 of pData is false
> into lastLine" guarantees that the first line will be included, without
> having to use a conditional that only matters the first iteration inside
> the repeat.
>
> Huh, that's weird. I copy/pasted your code into a button and ran it:
>
> 10 iterations on 100 lines of 5 or fewer chars:
>
> Array: 2 ms (0.2 ms per iteration)
>
> Chunks: 1 ms (0.1 ms per iteration)
>
> Results match - Each list has 95 lines
>
>
> I ran that several times, and the winner flip-flopped several times. So I
> switched to the long seconds. With that, the "Chunks" version is almost
> always the winner (if only by a few ten-thousandths of a second) Typical
> result:
>
>
> 10 iterations on 100 lines of 5 or fewer chars:
>
> Array: 0.001466 seconds (0.000147 seconds per iteration)
>
> Chunks: 0.001218 seconds (0.000122 seconds per iteration)
>
> Results match - Each list has 97 lines
>
>
> Curiouser and curiouser:
>
>
> 10 iterations on 100 lines of 250 or fewer chars:
>
> Array: 0.002393 seconds (0.000239 seconds per iteration)
>
> Chunks: 0.001738 seconds (0.000174 seconds per iteration)
>
> Results match - Each list has 97 lines
>
>
>
> With 1000 lines it favors the array more often, but I still saw outcomes
> where chunks won (not this result, obviously -- trying to be
> representative):
>
>
> 10 iterations on 1000 lines of 5 or fewer chars:
>
> Array: 0.007609 seconds (0.000761 seconds per iteration)
>
> Chunks: 0.007894 seconds (0.000789 seconds per iteration)
>
> Results match - Each list has 617 lines
>
>
> And then back to chunks (mostly):
>
>
> 10 iterations on 1000 lines of 250 or fewer chars:
>
> Array: 0.015478 seconds (0.001548 seconds per iteration)
>
> Chunks: 0.015227 seconds (0.001523 seconds per iteration)
>
> Results match - Each list has 740 lines
>
>
> We start converging at 10,000 and beyond:
>
>
> 10 iterations on 10000 lines of 5 or fewer chars:
>
> Array: 0.029378 seconds (0.002938 seconds per iteration)
>
> Chunks: 0.06806 seconds (0.006806 seconds per iteration)
>
> Results match - Each list has 988 lines
>
>
> 10 iterations on 10000 lines of 250 or fewer chars:
>
> Array: 0.071169 seconds (0.007117 seconds per iteration)
>
> Chunks: 0.148104 seconds (0.01481 seconds per iteration)
>
> Results match - Each list has 1492 lines
>
>
> 10 iterations on 100000 lines of 5 or fewer chars:
>
> Array: 0.229239 seconds (0.022924 seconds per iteration)
>
> Chunks: 0.732289 seconds (0.073229 seconds per iteration)
>
> Results match - Each list has 985 lines
>
>
> 10 iterations on 100000 lines of 250 or fewer chars:
>
> Array: 0.604814 seconds (0.060481 seconds per iteration)
>
> Chunks: 2.04249 seconds (0.204249 seconds per iteration)
>
> Results match - Each list has 1494 lines




More information about the use-livecode mailing list