how to compare 2 very large textfiles

Alex Tweedly alex at tweedly.net
Thu Oct 6 19:57:51 EDT 2011


On 07/10/2011 00:45, Pete wrote:
> Thanks Alex.  I managed to cobble something together to get the test lists.
>
>
> I did try the binary search approach and it was way slower than the array
> approach as you predicted (Still much faster than the original code Matthias
> was using though).  So I'm happy with the array technique now.  Someone
> posted a variation on my original code which might be slightly faster.
>

You got me interested .... :-)

So I tried the sort + compare version. It is slightly slower than the 
array technique up to around 10,000 lines, pretty much the same up to 
20,000 lines and then (sometimes) starts to edge ahead after that. I 
gave up trying at 40,000 lines :-)

But if the data had been sorted already, or had to be sorted for some 
other reason, then it would be roughly twice as quick as the array method.

-- Alex.
> on way0 pA, pB, @inAnotB, @inBnotA
>    -- NB data sets must NOT be passed by ref because they are modified
>    local t1, tA, tB, LA, LB, tLastDup
>    put the millisecs into t1
>    sort lines of pA
>    sort lines of pB
>    put "way 0 sorting " && the millisecs - t1 & CR after field "F"
>
>    -- now start the compare
>    put 1 into tA
>    put 1 into tB
>    put empty into tLastDup
>    repeat forever
>       if tA >= the number of chars in pA then
>          put char tB to -1 of pB after inBnotA
>          exit repeat
>       end if
>       if tB >= the number of chars in pB then
>          put char tA to -1 of pA after inAnotB
>          exit repeat
>       end if
>       put line 1 of (char tA to -1 of pA) into LA
>       put line 1 of (char tB to -1 of pB) into LB
>       switch
>          case LA = LB
>             put  LA into tLastDup
>             add (the number of chars in LA + 1) to tA
>             add (the number of chars in LB + 1) to tB
>             break
>
>          case LA < LB
>             if LA <> tLastDup then
>                put LA & CR after inAnotB
>                put empty into tLastDup
>             end if
>             add the number of chars in LA+1 to tA
>             break
>
>          case LA > LB
>             if LB <> tLastDup then
>                put LB & CR after inBnotA
>                put empty into tLastDup
>             end if
>             add the number of chars in LB+1 to tB
>             break
>
>       end switch
>
>    end repeat
>    put "way0" && the millisecs-t1 &&  the number of lines in inAnotB 
> && the number of lines in inBnotA &CR after field "F"
>
> end way0
>





More information about the use-livecode mailing list