There should be a "unique" option on sort . . .

Peter Haworth pete at lcsql.com
Mon Jan 6 11:32:20 EST 2014


Hi Richard,
Thanks for the speed tests.

Your handler to use an array isn't quite the same as mine. You have a
repeat loop adding 1 to the key to create the array whereas I used combine.
Be interesting to see if that makes any significant difference.

Pete
lcSQL Software
On Jan 6, 2014 7:13 AM, "Richard Gaskin" <ambassador at fourthworld.com> wrote:

> Geoff Canyon wrote:
>
>  I ran a brief speed test of the code below against using an array to
>> de-dupe and then sort the keys, and couldn't find a data set that made the
>> below code slower. It's not always much faster, but for some data sets it
>> is.
>>
>
> One of the challenges with this sort of comparative benchmarking is
> accounting for scale, and doing so in all relevant dimensions.
>
> In this scenario, two scaling factors come to mind:
> - Number of lines in the data set
> - Number of duplicates among those lines
>
> The test script below has variables at the top which can be altered to
> account for each of those, with an additional variable for number of
> iterations (small data sets can be too quick to measure well).
>
> The size of the data set is simply the number of lines it contains (n).
>  The percentage of duplicates is likely higher when the strings are
> shorter, so tMaxLineSize handles that.
>
> The results suggest two things:
>
> 1. Larger data sets favor arrays.  At 100k lines this was roughly a
> four-fold boost, with no apparent penalty for smaller data sets.
>
> 2. A larger percentage of duplicates favors arrays, but only slightly, far
> less so than the total size of the data set.
>
>
> Here are the results, run for data sets of 100, 1,000, 10,000, and 100,000
> lines, once each for strings with a maximum of 5 chars, and again with a
> maximum of 250 chars:
>
>
> 10 iterations on 100 lines of 5 or fewer chars:
> Array: 2 ms (0.2 ms per iteration)
> Chunks: 3 ms (0.3 ms per iteration)
> Results match - Each list has 91 lines
>
> 10 iterations on 100 lines of 250 or fewer chars:
> Array: 5 ms (0.5 ms per iteration)
> Chunks: 5 ms (0.5 ms per iteration)
> Results match - Each list has 100 lines
>
>
> 10 iterations on 1000 lines of 5 or fewer chars:
> Array: 13 ms (1.3 ms per iteration)
> Chunks: 17 ms (1.7 ms per iteration)
> Results match - Each list has 629 lines
>
> 10 iterations on 1000 lines of 250 or fewer chars:
> Array: 34 ms (3.4 ms per iteration)
> Chunks: 36 ms (3.6 ms per iteration)
> Results match - Each list has 735 lines
>
>
> 10 iterations on 10000 lines of 5 or fewer chars:
> Array: 61 ms (6.1 ms per iteration)
> Chunks: 150 ms (15 ms per iteration)
> Results match - Each list has 993 lines
>
> 10 iterations on 10000 lines of 250 or fewer chars:
> Array: 181 ms (18.1 ms per iteration)
> Chunks: 414 ms (41.4 ms per iteration)
> Results match - Each list has 1493 lines
>
>
> 10 iterations on 100000 lines of 5 or fewer chars:
> Array: 492 ms (49.2 ms per iteration)
> Chunks: 1674 ms (167.4 ms per iteration)
> Results match - Each list has 987 lines
>
> 10 iterations on 100000 lines of 250 or fewer chars:
> Array: 1514 ms (151.4 ms per iteration)
> Chunks: 5502 ms (550.2 ms per iteration)
> Results match - Each list has 1494 lines
>
>
> Here's the code - thrown together in a few minutes, it may well not be a
> complete measure of what we're looking for, so if you see ways to improve
> it please feel free to revise:
>
> on mouseUp
>    put 100000 into n -- number of lines of data
>    put 10 into x -- number of iterations
>    put 250 into tMaxLineSize -- Max length of each string
>    --
>    -- Make n lines of random data:
>    put empty into tData
>    -- First make one line of data:
>    repeat with i = 1 to 256 -- make random string 256 bytes long
>       -- in the ASCI range 48 through 122
>       put numToChar(random(73)+48) after s
>    end repeat
>    -- Make n lines from that data, each at least Z bytes:
>    put 256-tMaxLineSize into tStartMax
>    repeat n
>       put random(tStartMax) into tStart
>       put tStart + random(tMaxLineSize-1) into tEnd
>       put char tStart to tEnd of s &cr after tData
>    end repeat
>    --
>    -- Test 1: Array
>    put the millisecs into t
>    repeat x
>       put UniqueListFromArray(tData) into r1
>    end repeat
>    put the millisecs - t into t1
>    --
>    -- Test 2: Chunks
>    put the millisecs into t
>    repeat x
>       put UniqueListFromChunks(tData) into r2
>    end repeat
>    put the millisecs - t into t2
>    --
>    -- Display results:
>    if (r1=r2) then
>       put "Results match - Each list has "&\
>       the number of lines of r1 & " lines" into tSanityCheck
>    else
>       put "RESULTS DON'T MATCH" into tSanityCheck
>    end if
>    put x & " iterations on "& n &" lines of "& tMaxLineSize &\
>    " or fewer chars:"&cr\
>    &"Array: "& t1 &" ms (" & t1/x &" ms per iteration)"&cr\
>    &"Chunks: "& t2 &" ms (" & t2/x &" ms per iteration)"&cr\
>    & tSanityCheck
> end mouseUp
>
>
> function UniqueListFromArray pData
>    repeat for each line tLine in pData
>       add 1 to tA[tLine]
>    end repeat
>    put the keys of tA into tKeys
>    sort lines of tKeys
>    return tKeys
> end UniqueListFromArray
>
>
> function UniqueListFromChunks pData
>    sort lines of pData
>    # put line 1 of pData is false into lastLine -- what does this do?
>    put empty into lastLine
>    repeat for each line tLine in pData
>       if tLine is tLastLine then next repeat
>       put tLine & cr after tResult
>       put tLine into tLastLine
>    end repeat
>    delete last char of tResult
>    return tResult
> end UniqueListFromChunks
>
>
> --
>  Richard Gaskin
>  Fourth World
>  LiveCode training and consulting: http://www.fourthworld.com
>  Webzine for LiveCode developers: http://www.LiveCodeJournal.com
>  Follow me on Twitter:  http://twitter.com/FourthWorldSys
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



More information about the use-livecode mailing list