Finding duplicates in a list

Ian Wood revlist at azurevision.co.uk
Wed Jan 9 00:44:41 EST 2008


The problem - trying to find duplicate files in a database (Apple  
Aperture), and have found a checksum column for all the image files.

I've had a go at writing a handler to find the dupes and it does OK,  
but wondered if the bright sparks on the list have any advice on  
speeding it up it...

The handler:

====================

put the milliseconds into tt
   put ijwAPLIB_getAllChecksums() into tList  -- this returns the list  
of checksums, 10k in my sample BD, over 40k in the 'real' DB
   put number of lines of tList into tNumLines
   sort tlist
   put 0 into x
   repeat tNumLines times
     add 1 to x
     if last char of x is 1 then set the cursor to busy  -- removing  
this speeds it up by roughly 10%
     put line x of tList into tCheck
     if tCheck is empty then next repeat
     put x + 1 into y
     repeat (tNumLines - x) times
       put line y of tList into tOther
       if tCheck is tOther then
         put x & tab & y & tab & tCheck & return after tRet
       else
         put y into x
         exit repeat
       end if
       add 1 to y
     end repeat
   end repeat
   put the milliseconds - tt & return & "number of files:" &&  
tNumLines & return & return & tRet

====================

Sample results:

9804
number of files: 8708

116	117	027351c1bed597af774536af8e982363
119	120	0292d175c04d790f50246a5ee043a599
162	163	03d6313ee21a91ed0b0343f339c583e4
185	186	046ddab379a8f44955f1d5605c294605
230	231	05a77db5e76eb02f8d439e13286d3620
245	246	065474aa9bba7e2f24c7435863f5f2ff
314	315	0884f4b24b5bd99ddefdb100fde58a31
333	334	0918ce2135933d6c8f0ee2860837b5f9
360	361	0a2525bef1a46a329b7e902981ef94e2
360	362	0a2525bef1a46a329b7e902981ef94e2
360	363	0a2525bef1a46a329b7e902981ef94e2
360	364	0a2525bef1a46a329b7e902981ef94e2

Ian



More information about the use-livecode mailing list