Finding duplicates in a list

Eric Chatonet eric.chatonet at sosmartsoftware.com
Wed Jan 9 06:27:50 EST 2008


Hi Ian,

I just tried a simple repeat for each:

function Dups pList
   local tList2,tList3,tTimer,tStart
   -----
   ShowProgress 0,the number of lines of pList --
   put the milliseconds into tStart
   put 0 into tTimer
   repeat for each line tLine in pList
     if tTimer mod 100 = 0 then ShowProgress tTimer --
     add 1 to tTimer
     if tLine is not in tList2 then put tLine & cr after tList2
     else put tLine & cr after tList3
   end repeat
   ShowProgress 0 --
   return the milliseconds - tStart && "ms" & cr & the number of  
lines of pList & cr & the number of lines of tList3 & cr & tList3
end Dups
-------------------------------
on ShowProgress pPos,pEnd
   set the thumbpos of sb "Progress" to pPos
   if pEnd <> empty then set the endvalue of sb "Progress" to pEnd
end ShowProgress

This ran in about 5 seconds on my Vista machine using your list and  
returned 686 duplicates among 8708 references.
The problem with such a method is that it is slowing down as the  
check progresses because tList2 is growing :-(
I tried to imagine another solution using arrays

Best regards from Paris,
Eric Chatonet.

Le 9 janv. 08 à 06:44, Ian Wood a écrit :

> The problem - trying to find duplicate files in a database (Apple  
> Aperture), and have found a checksum column for all the image files.
>
> I've had a go at writing a handler to find the dupes and it does  
> OK, but wondered if the bright sparks on the list have any advice  
> on speeding it up it...
>
> The handler:
>
> ====================
>
> put the milliseconds into tt
>   put ijwAPLIB_getAllChecksums() into tList  -- this returns the  
> list of checksums, 10k in my sample BD, over 40k in the 'real' DB
>   put number of lines of tList into tNumLines
>   sort tlist
>   put 0 into x
>   repeat tNumLines times
>     add 1 to x
>     if last char of x is 1 then set the cursor to busy  -- removing  
> this speeds it up by roughly 10%
>     put line x of tList into tCheck
>     if tCheck is empty then next repeat
>     put x + 1 into y
>     repeat (tNumLines - x) times
>       put line y of tList into tOther
>       if tCheck is tOther then
>         put x & tab & y & tab & tCheck & return after tRet
>       else
>         put y into x
>         exit repeat
>       end if
>       add 1 to y
>     end repeat
>   end repeat
>   put the milliseconds - tt & return & "number of files:" &&  
> tNumLines & return & return & tRet
>
> ====================
>
> Sample results:
>
> 9804
> number of files: 8708
>
> 116	117	027351c1bed597af774536af8e982363
> 119	120	0292d175c04d790f50246a5ee043a599
> 162	163	03d6313ee21a91ed0b0343f339c583e4
> 185	186	046ddab379a8f44955f1d5605c294605
> 230	231	05a77db5e76eb02f8d439e13286d3620
> 245	246	065474aa9bba7e2f24c7435863f5f2ff
> 314	315	0884f4b24b5bd99ddefdb100fde58a31
> 333	334	0918ce2135933d6c8f0ee2860837b5f9
> 360	361	0a2525bef1a46a329b7e902981ef94e2
> 360	362	0a2525bef1a46a329b7e902981ef94e2
> 360	363	0a2525bef1a46a329b7e902981ef94e2
> 360	364	0a2525bef1a46a329b7e902981ef94e2
>
> Ian

----------------------------------------------------------------
Plugins and tutorials for Revolution: http://www.sosmartsoftware.com/
Email: eric.chatonet at sosmartsoftware.com/
----------------------------------------------------------------





More information about the use-livecode mailing list