Finding duplicates in a list

Eric Chatonet eric.chatonet at sosmartsoftware.com
Wed Jan 9 07:19:00 EST 2008


Hi Ian,

I'm sorry, I did not see that my email went out without being finished:
Second part about a solution using arrays is missing but it does not  
matter because you got Bill's answer.
Mine was almost the same :-)

Best regards from Paris,
Eric Chatonet.

Le 9 janv. 08 à 12:27, Eric Chatonet a écrit :

> Hi Ian,
>
> I just tried a simple repeat for each:
>
> function Dups pList
>   local tList2,tList3,tTimer,tStart
>   -----
>   ShowProgress 0,the number of lines of pList --
>   put the milliseconds into tStart
>   put 0 into tTimer
>   repeat for each line tLine in pList
>     if tTimer mod 100 = 0 then ShowProgress tTimer --
>     add 1 to tTimer
>     if tLine is not in tList2 then put tLine & cr after tList2
>     else put tLine & cr after tList3
>   end repeat
>   ShowProgress 0 --
>   return the milliseconds - tStart && "ms" & cr & the number of  
> lines of pList & cr & the number of lines of tList3 & cr & tList3
> end Dups
> -------------------------------
> on ShowProgress pPos,pEnd
>   set the thumbpos of sb "Progress" to pPos
>   if pEnd <> empty then set the endvalue of sb "Progress" to pEnd
> end ShowProgress
>
> This ran in about 5 seconds on my Vista machine using your list and  
> returned 686 duplicates among 8708 references.
> The problem with such a method is that it is slowing down as the  
> check progresses because tList2 is growing :-(
> I tried to imagine another solution using arrays
>
> Best regards from Paris,
> Eric Chatonet.
>
> Le 9 janv. 08 à 06:44, Ian Wood a écrit :
>
>> The problem - trying to find duplicate files in a database (Apple  
>> Aperture), and have found a checksum column for all the image files.
>>
>> I've had a go at writing a handler to find the dupes and it does  
>> OK, but wondered if the bright sparks on the list have any advice  
>> on speeding it up it...
>>
>> The handler:
>>
>> ====================
>>
>> put the milliseconds into tt
>>   put ijwAPLIB_getAllChecksums() into tList  -- this returns the  
>> list of checksums, 10k in my sample BD, over 40k in the 'real' DB
>>   put number of lines of tList into tNumLines
>>   sort tlist
>>   put 0 into x
>>   repeat tNumLines times
>>     add 1 to x
>>     if last char of x is 1 then set the cursor to busy  --  
>> removing this speeds it up by roughly 10%
>>     put line x of tList into tCheck
>>     if tCheck is empty then next repeat
>>     put x + 1 into y
>>     repeat (tNumLines - x) times
>>       put line y of tList into tOther
>>       if tCheck is tOther then
>>         put x & tab & y & tab & tCheck & return after tRet
>>       else
>>         put y into x
>>         exit repeat
>>       end if
>>       add 1 to y
>>     end repeat
>>   end repeat
>>   put the milliseconds - tt & return & "number of files:" &&  
>> tNumLines & return & return & tRet
>>
>> ====================
>>
>> Sample results:
>>
>> 9804
>> number of files: 8708
>>
>> 116	117	027351c1bed597af774536af8e982363
>> 119	120	0292d175c04d790f50246a5ee043a599
>> 162	163	03d6313ee21a91ed0b0343f339c583e4
>> 185	186	046ddab379a8f44955f1d5605c294605
>> 230	231	05a77db5e76eb02f8d439e13286d3620
>> 245	246	065474aa9bba7e2f24c7435863f5f2ff
>> 314	315	0884f4b24b5bd99ddefdb100fde58a31
>> 333	334	0918ce2135933d6c8f0ee2860837b5f9
>> 360	361	0a2525bef1a46a329b7e902981ef94e2
>> 360	362	0a2525bef1a46a329b7e902981ef94e2
>> 360	363	0a2525bef1a46a329b7e902981ef94e2
>> 360	364	0a2525bef1a46a329b7e902981ef94e2
>>
>> Ian
>

Best regards from Paris,
Eric Chatonet.
----------------------------------------------------------------
Plugins and tutorials for Revolution: http://www.sosmartsoftware.com/
Email: eric.chatonet at sosmartsoftware.com/
----------------------------------------------------------------





More information about the use-livecode mailing list