There should be a "unique" option on sort . . .
Peter Haworth
pete at lcsql.com
Mon Jan 6 11:32:20 EST 2014
Hi Richard,
Thanks for the speed tests.
Your handler to use an array isn't quite the same as mine. You have a
repeat loop adding 1 to the key to create the array whereas I used combine.
Be interesting to see if that makes any significant difference.
Pete
lcSQL Software
On Jan 6, 2014 7:13 AM, "Richard Gaskin" <ambassador at fourthworld.com> wrote:
> Geoff Canyon wrote:
>
> I ran a brief speed test of the code below against using an array to
>> de-dupe and then sort the keys, and couldn't find a data set that made the
>> below code slower. It's not always much faster, but for some data sets it
>> is.
>>
>
> One of the challenges with this sort of comparative benchmarking is
> accounting for scale, and doing so in all relevant dimensions.
>
> In this scenario, two scaling factors come to mind:
> - Number of lines in the data set
> - Number of duplicates among those lines
>
> The test script below has variables at the top which can be altered to
> account for each of those, with an additional variable for number of
> iterations (small data sets can be too quick to measure well).
>
> The size of the data set is simply the number of lines it contains (n).
> The percentage of duplicates is likely higher when the strings are
> shorter, so tMaxLineSize handles that.
>
> The results suggest two things:
>
> 1. Larger data sets favor arrays. At 100k lines this was roughly a
> four-fold boost, with no apparent penalty for smaller data sets.
>
> 2. A larger percentage of duplicates favors arrays, but only slightly, far
> less so than the total size of the data set.
>
>
> Here are the results, run for data sets of 100, 1,000, 10,000, and 100,000
> lines, once each for strings with a maximum of 5 chars, and again with a
> maximum of 250 chars:
>
>
> 10 iterations on 100 lines of 5 or fewer chars:
> Array: 2 ms (0.2 ms per iteration)
> Chunks: 3 ms (0.3 ms per iteration)
> Results match - Each list has 91 lines
>
> 10 iterations on 100 lines of 250 or fewer chars:
> Array: 5 ms (0.5 ms per iteration)
> Chunks: 5 ms (0.5 ms per iteration)
> Results match - Each list has 100 lines
>
>
> 10 iterations on 1000 lines of 5 or fewer chars:
> Array: 13 ms (1.3 ms per iteration)
> Chunks: 17 ms (1.7 ms per iteration)
> Results match - Each list has 629 lines
>
> 10 iterations on 1000 lines of 250 or fewer chars:
> Array: 34 ms (3.4 ms per iteration)
> Chunks: 36 ms (3.6 ms per iteration)
> Results match - Each list has 735 lines
>
>
> 10 iterations on 10000 lines of 5 or fewer chars:
> Array: 61 ms (6.1 ms per iteration)
> Chunks: 150 ms (15 ms per iteration)
> Results match - Each list has 993 lines
>
> 10 iterations on 10000 lines of 250 or fewer chars:
> Array: 181 ms (18.1 ms per iteration)
> Chunks: 414 ms (41.4 ms per iteration)
> Results match - Each list has 1493 lines
>
>
> 10 iterations on 100000 lines of 5 or fewer chars:
> Array: 492 ms (49.2 ms per iteration)
> Chunks: 1674 ms (167.4 ms per iteration)
> Results match - Each list has 987 lines
>
> 10 iterations on 100000 lines of 250 or fewer chars:
> Array: 1514 ms (151.4 ms per iteration)
> Chunks: 5502 ms (550.2 ms per iteration)
> Results match - Each list has 1494 lines
>
>
> Here's the code - thrown together in a few minutes, it may well not be a
> complete measure of what we're looking for, so if you see ways to improve
> it please feel free to revise:
>
> on mouseUp
> put 100000 into n -- number of lines of data
> put 10 into x -- number of iterations
> put 250 into tMaxLineSize -- Max length of each string
> --
> -- Make n lines of random data:
> put empty into tData
> -- First make one line of data:
> repeat with i = 1 to 256 -- make random string 256 bytes long
> -- in the ASCI range 48 through 122
> put numToChar(random(73)+48) after s
> end repeat
> -- Make n lines from that data, each at least Z bytes:
> put 256-tMaxLineSize into tStartMax
> repeat n
> put random(tStartMax) into tStart
> put tStart + random(tMaxLineSize-1) into tEnd
> put char tStart to tEnd of s &cr after tData
> end repeat
> --
> -- Test 1: Array
> put the millisecs into t
> repeat x
> put UniqueListFromArray(tData) into r1
> end repeat
> put the millisecs - t into t1
> --
> -- Test 2: Chunks
> put the millisecs into t
> repeat x
> put UniqueListFromChunks(tData) into r2
> end repeat
> put the millisecs - t into t2
> --
> -- Display results:
> if (r1=r2) then
> put "Results match - Each list has "&\
> the number of lines of r1 & " lines" into tSanityCheck
> else
> put "RESULTS DON'T MATCH" into tSanityCheck
> end if
> put x & " iterations on "& n &" lines of "& tMaxLineSize &\
> " or fewer chars:"&cr\
> &"Array: "& t1 &" ms (" & t1/x &" ms per iteration)"&cr\
> &"Chunks: "& t2 &" ms (" & t2/x &" ms per iteration)"&cr\
> & tSanityCheck
> end mouseUp
>
>
> function UniqueListFromArray pData
> repeat for each line tLine in pData
> add 1 to tA[tLine]
> end repeat
> put the keys of tA into tKeys
> sort lines of tKeys
> return tKeys
> end UniqueListFromArray
>
>
> function UniqueListFromChunks pData
> sort lines of pData
> # put line 1 of pData is false into lastLine -- what does this do?
> put empty into lastLine
> repeat for each line tLine in pData
> if tLine is tLastLine then next repeat
> put tLine & cr after tResult
> put tLine into tLastLine
> end repeat
> delete last char of tResult
> return tResult
> end UniqueListFromChunks
>
>
> --
> Richard Gaskin
> Fourth World
> LiveCode training and consulting: http://www.fourthworld.com
> Webzine for LiveCode developers: http://www.LiveCodeJournal.com
> Follow me on Twitter: http://twitter.com/FourthWorldSys
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
More information about the use-livecode
mailing list