There should be a "unique" option on sort . . .

Richard Gaskin ambassador at fourthworld.com
Mon Jan 6 10:13:23 EST 2014


Geoff Canyon wrote:

> I ran a brief speed test of the code below against using an array to
> de-dupe and then sort the keys, and couldn't find a data set that made the
> below code slower. It's not always much faster, but for some data sets it
> is.

One of the challenges with this sort of comparative benchmarking is 
accounting for scale, and doing so in all relevant dimensions.

In this scenario, two scaling factors come to mind:
- Number of lines in the data set
- Number of duplicates among those lines

The test script below has variables at the top which can be altered to 
account for each of those, with an additional variable for number of 
iterations (small data sets can be too quick to measure well).

The size of the data set is simply the number of lines it contains (n). 
  The percentage of duplicates is likely higher when the strings are 
shorter, so tMaxLineSize handles that.

The results suggest two things:

1. Larger data sets favor arrays.  At 100k lines this was roughly a 
four-fold boost, with no apparent penalty for smaller data sets.

2. A larger percentage of duplicates favors arrays, but only slightly, 
far less so than the total size of the data set.


Here are the results, run for data sets of 100, 1,000, 10,000, and 
100,000 lines, once each for strings with a maximum of 5 chars, and 
again with a maximum of 250 chars:


10 iterations on 100 lines of 5 or fewer chars:
Array: 2 ms (0.2 ms per iteration)
Chunks: 3 ms (0.3 ms per iteration)
Results match - Each list has 91 lines

10 iterations on 100 lines of 250 or fewer chars:
Array: 5 ms (0.5 ms per iteration)
Chunks: 5 ms (0.5 ms per iteration)
Results match - Each list has 100 lines


10 iterations on 1000 lines of 5 or fewer chars:
Array: 13 ms (1.3 ms per iteration)
Chunks: 17 ms (1.7 ms per iteration)
Results match - Each list has 629 lines

10 iterations on 1000 lines of 250 or fewer chars:
Array: 34 ms (3.4 ms per iteration)
Chunks: 36 ms (3.6 ms per iteration)
Results match - Each list has 735 lines


10 iterations on 10000 lines of 5 or fewer chars:
Array: 61 ms (6.1 ms per iteration)
Chunks: 150 ms (15 ms per iteration)
Results match - Each list has 993 lines

10 iterations on 10000 lines of 250 or fewer chars:
Array: 181 ms (18.1 ms per iteration)
Chunks: 414 ms (41.4 ms per iteration)
Results match - Each list has 1493 lines


10 iterations on 100000 lines of 5 or fewer chars:
Array: 492 ms (49.2 ms per iteration)
Chunks: 1674 ms (167.4 ms per iteration)
Results match - Each list has 987 lines

10 iterations on 100000 lines of 250 or fewer chars:
Array: 1514 ms (151.4 ms per iteration)
Chunks: 5502 ms (550.2 ms per iteration)
Results match - Each list has 1494 lines


Here's the code - thrown together in a few minutes, it may well not be a 
complete measure of what we're looking for, so if you see ways to 
improve it please feel free to revise:

on mouseUp
    put 100000 into n -- number of lines of data
    put 10 into x -- number of iterations
    put 250 into tMaxLineSize -- Max length of each string
    --
    -- Make n lines of random data:
    put empty into tData
    -- First make one line of data:
    repeat with i = 1 to 256 -- make random string 256 bytes long
       -- in the ASCI range 48 through 122
       put numToChar(random(73)+48) after s
    end repeat
    -- Make n lines from that data, each at least Z bytes:
    put 256-tMaxLineSize into tStartMax
    repeat n
       put random(tStartMax) into tStart
       put tStart + random(tMaxLineSize-1) into tEnd
       put char tStart to tEnd of s &cr after tData
    end repeat
    --
    -- Test 1: Array
    put the millisecs into t
    repeat x
       put UniqueListFromArray(tData) into r1
    end repeat
    put the millisecs - t into t1
    --
    -- Test 2: Chunks
    put the millisecs into t
    repeat x
       put UniqueListFromChunks(tData) into r2
    end repeat
    put the millisecs - t into t2
    --
    -- Display results:
    if (r1=r2) then
       put "Results match - Each list has "&\
       the number of lines of r1 & " lines" into tSanityCheck
    else
       put "RESULTS DON'T MATCH" into tSanityCheck
    end if
    put x & " iterations on "& n &" lines of "& tMaxLineSize &\
    " or fewer chars:"&cr\
    &"Array: "& t1 &" ms (" & t1/x &" ms per iteration)"&cr\
    &"Chunks: "& t2 &" ms (" & t2/x &" ms per iteration)"&cr\
    & tSanityCheck
end mouseUp


function UniqueListFromArray pData
    repeat for each line tLine in pData
       add 1 to tA[tLine]
    end repeat
    put the keys of tA into tKeys
    sort lines of tKeys
    return tKeys
end UniqueListFromArray


function UniqueListFromChunks pData
    sort lines of pData
    # put line 1 of pData is false into lastLine -- what does this do?
    put empty into lastLine
    repeat for each line tLine in pData
       if tLine is tLastLine then next repeat
       put tLine & cr after tResult
       put tLine into tLastLine
    end repeat
    delete last char of tResult
    return tResult
end UniqueListFromChunks


--
  Richard Gaskin
  Fourth World
  LiveCode training and consulting: http://www.fourthworld.com
  Webzine for LiveCode developers: http://www.LiveCodeJournal.com
  Follow me on Twitter:  http://twitter.com/FourthWorldSys




More information about the use-livecode mailing list