There should be a "unique" option on sort . . .
Richard Gaskin
ambassador at fourthworld.com
Mon Jan 6 10:13:23 EST 2014
Geoff Canyon wrote:
> I ran a brief speed test of the code below against using an array to
> de-dupe and then sort the keys, and couldn't find a data set that made the
> below code slower. It's not always much faster, but for some data sets it
> is.
One of the challenges with this sort of comparative benchmarking is
accounting for scale, and doing so in all relevant dimensions.
In this scenario, two scaling factors come to mind:
- Number of lines in the data set
- Number of duplicates among those lines
The test script below has variables at the top which can be altered to
account for each of those, with an additional variable for number of
iterations (small data sets can be too quick to measure well).
The size of the data set is simply the number of lines it contains (n).
The percentage of duplicates is likely higher when the strings are
shorter, so tMaxLineSize handles that.
The results suggest two things:
1. Larger data sets favor arrays. At 100k lines this was roughly a
four-fold boost, with no apparent penalty for smaller data sets.
2. A larger percentage of duplicates favors arrays, but only slightly,
far less so than the total size of the data set.
Here are the results, run for data sets of 100, 1,000, 10,000, and
100,000 lines, once each for strings with a maximum of 5 chars, and
again with a maximum of 250 chars:
10 iterations on 100 lines of 5 or fewer chars:
Array: 2 ms (0.2 ms per iteration)
Chunks: 3 ms (0.3 ms per iteration)
Results match - Each list has 91 lines
10 iterations on 100 lines of 250 or fewer chars:
Array: 5 ms (0.5 ms per iteration)
Chunks: 5 ms (0.5 ms per iteration)
Results match - Each list has 100 lines
10 iterations on 1000 lines of 5 or fewer chars:
Array: 13 ms (1.3 ms per iteration)
Chunks: 17 ms (1.7 ms per iteration)
Results match - Each list has 629 lines
10 iterations on 1000 lines of 250 or fewer chars:
Array: 34 ms (3.4 ms per iteration)
Chunks: 36 ms (3.6 ms per iteration)
Results match - Each list has 735 lines
10 iterations on 10000 lines of 5 or fewer chars:
Array: 61 ms (6.1 ms per iteration)
Chunks: 150 ms (15 ms per iteration)
Results match - Each list has 993 lines
10 iterations on 10000 lines of 250 or fewer chars:
Array: 181 ms (18.1 ms per iteration)
Chunks: 414 ms (41.4 ms per iteration)
Results match - Each list has 1493 lines
10 iterations on 100000 lines of 5 or fewer chars:
Array: 492 ms (49.2 ms per iteration)
Chunks: 1674 ms (167.4 ms per iteration)
Results match - Each list has 987 lines
10 iterations on 100000 lines of 250 or fewer chars:
Array: 1514 ms (151.4 ms per iteration)
Chunks: 5502 ms (550.2 ms per iteration)
Results match - Each list has 1494 lines
Here's the code - thrown together in a few minutes, it may well not be a
complete measure of what we're looking for, so if you see ways to
improve it please feel free to revise:
on mouseUp
put 100000 into n -- number of lines of data
put 10 into x -- number of iterations
put 250 into tMaxLineSize -- Max length of each string
--
-- Make n lines of random data:
put empty into tData
-- First make one line of data:
repeat with i = 1 to 256 -- make random string 256 bytes long
-- in the ASCI range 48 through 122
put numToChar(random(73)+48) after s
end repeat
-- Make n lines from that data, each at least Z bytes:
put 256-tMaxLineSize into tStartMax
repeat n
put random(tStartMax) into tStart
put tStart + random(tMaxLineSize-1) into tEnd
put char tStart to tEnd of s &cr after tData
end repeat
--
-- Test 1: Array
put the millisecs into t
repeat x
put UniqueListFromArray(tData) into r1
end repeat
put the millisecs - t into t1
--
-- Test 2: Chunks
put the millisecs into t
repeat x
put UniqueListFromChunks(tData) into r2
end repeat
put the millisecs - t into t2
--
-- Display results:
if (r1=r2) then
put "Results match - Each list has "&\
the number of lines of r1 & " lines" into tSanityCheck
else
put "RESULTS DON'T MATCH" into tSanityCheck
end if
put x & " iterations on "& n &" lines of "& tMaxLineSize &\
" or fewer chars:"&cr\
&"Array: "& t1 &" ms (" & t1/x &" ms per iteration)"&cr\
&"Chunks: "& t2 &" ms (" & t2/x &" ms per iteration)"&cr\
& tSanityCheck
end mouseUp
function UniqueListFromArray pData
repeat for each line tLine in pData
add 1 to tA[tLine]
end repeat
put the keys of tA into tKeys
sort lines of tKeys
return tKeys
end UniqueListFromArray
function UniqueListFromChunks pData
sort lines of pData
# put line 1 of pData is false into lastLine -- what does this do?
put empty into lastLine
repeat for each line tLine in pData
if tLine is tLastLine then next repeat
put tLine & cr after tResult
put tLine into tLastLine
end repeat
delete last char of tResult
return tResult
end UniqueListFromChunks
--
Richard Gaskin
Fourth World
LiveCode training and consulting: http://www.fourthworld.com
Webzine for LiveCode developers: http://www.LiveCodeJournal.com
Follow me on Twitter: http://twitter.com/FourthWorldSys
More information about the use-livecode
mailing list