Finding duplicates in a list
Eric Chatonet
eric.chatonet at sosmartsoftware.com
Wed Jan 9 07:19:00 EST 2008
Hi Ian,
I'm sorry, I did not see that my email went out without being finished:
Second part about a solution using arrays is missing but it does not
matter because you got Bill's answer.
Mine was almost the same :-)
Best regards from Paris,
Eric Chatonet.
Le 9 janv. 08 à 12:27, Eric Chatonet a écrit :
> Hi Ian,
>
> I just tried a simple repeat for each:
>
> function Dups pList
> local tList2,tList3,tTimer,tStart
> -----
> ShowProgress 0,the number of lines of pList --
> put the milliseconds into tStart
> put 0 into tTimer
> repeat for each line tLine in pList
> if tTimer mod 100 = 0 then ShowProgress tTimer --
> add 1 to tTimer
> if tLine is not in tList2 then put tLine & cr after tList2
> else put tLine & cr after tList3
> end repeat
> ShowProgress 0 --
> return the milliseconds - tStart && "ms" & cr & the number of
> lines of pList & cr & the number of lines of tList3 & cr & tList3
> end Dups
> -------------------------------
> on ShowProgress pPos,pEnd
> set the thumbpos of sb "Progress" to pPos
> if pEnd <> empty then set the endvalue of sb "Progress" to pEnd
> end ShowProgress
>
> This ran in about 5 seconds on my Vista machine using your list and
> returned 686 duplicates among 8708 references.
> The problem with such a method is that it is slowing down as the
> check progresses because tList2 is growing :-(
> I tried to imagine another solution using arrays
>
> Best regards from Paris,
> Eric Chatonet.
>
> Le 9 janv. 08 à 06:44, Ian Wood a écrit :
>
>> The problem - trying to find duplicate files in a database (Apple
>> Aperture), and have found a checksum column for all the image files.
>>
>> I've had a go at writing a handler to find the dupes and it does
>> OK, but wondered if the bright sparks on the list have any advice
>> on speeding it up it...
>>
>> The handler:
>>
>> ====================
>>
>> put the milliseconds into tt
>> put ijwAPLIB_getAllChecksums() into tList -- this returns the
>> list of checksums, 10k in my sample BD, over 40k in the 'real' DB
>> put number of lines of tList into tNumLines
>> sort tlist
>> put 0 into x
>> repeat tNumLines times
>> add 1 to x
>> if last char of x is 1 then set the cursor to busy --
>> removing this speeds it up by roughly 10%
>> put line x of tList into tCheck
>> if tCheck is empty then next repeat
>> put x + 1 into y
>> repeat (tNumLines - x) times
>> put line y of tList into tOther
>> if tCheck is tOther then
>> put x & tab & y & tab & tCheck & return after tRet
>> else
>> put y into x
>> exit repeat
>> end if
>> add 1 to y
>> end repeat
>> end repeat
>> put the milliseconds - tt & return & "number of files:" &&
>> tNumLines & return & return & tRet
>>
>> ====================
>>
>> Sample results:
>>
>> 9804
>> number of files: 8708
>>
>> 116 117 027351c1bed597af774536af8e982363
>> 119 120 0292d175c04d790f50246a5ee043a599
>> 162 163 03d6313ee21a91ed0b0343f339c583e4
>> 185 186 046ddab379a8f44955f1d5605c294605
>> 230 231 05a77db5e76eb02f8d439e13286d3620
>> 245 246 065474aa9bba7e2f24c7435863f5f2ff
>> 314 315 0884f4b24b5bd99ddefdb100fde58a31
>> 333 334 0918ce2135933d6c8f0ee2860837b5f9
>> 360 361 0a2525bef1a46a329b7e902981ef94e2
>> 360 362 0a2525bef1a46a329b7e902981ef94e2
>> 360 363 0a2525bef1a46a329b7e902981ef94e2
>> 360 364 0a2525bef1a46a329b7e902981ef94e2
>>
>> Ian
>
Best regards from Paris,
Eric Chatonet.
----------------------------------------------------------------
Plugins and tutorials for Revolution: http://www.sosmartsoftware.com/
Email: eric.chatonet at sosmartsoftware.com/
----------------------------------------------------------------
More information about the use-livecode
mailing list