how to compare 2 very large textfiles

Pete pete at mollysrevenge.com
Thu Oct 6 20:56:27 EDT 2011


Just when I thought I was safe!

I see what you are doing to avoid LC locating lines itself, nice.  The
search I had in mind was different and used the binary search method and a
recursive routine.  Unfortunately I didn't save the code but it went roughy
like this (the lists still have to be sorted):

function binarySearch pstring, plist
  put round((the number of lines in plist /2)) into x
  if x < 1 then return false  -- pstring is not in plist
  switch
    case line x of pList = pstring  -- pstring is in plist
      return true
      break
    case line x of plist > pstring  -- pstring must be before the line being
checked
      binarySearch pstring, line 1 to x-1 of plist
      break
    case line x of plist < pstring --pstring must be after the line being
checked
      binarySearch pstring, line x+1  to -1 of pstring
      break
  end switch
end binarySearch

binarySearch gets called in a repeat loop for each line in listA and again
for each line in listB  If it returns false, the line is missing.   Not sure
if this could make use of your method avoiding LC line counting, maybe if
the lines were fixed length?


Pete
Molly's Revenge <http://www.mollysrevenge.com>




On Thu, Oct 6, 2011 at 4:57 PM, Alex Tweedly <alex at tweedly.net> wrote:
>
>
> You got me interested .... :-)
>
> So I tried the sort + compare version. It is slightly slower than the array
> technique up to around 10,000 lines, pretty much the same up to 20,000 lines
> and then (sometimes) starts to edge ahead after that. I gave up trying at
> 40,000 lines :-)
>
> But if the data had been sorted already, or had to be sorted for some other
> reason, then it would be roughly twice as quick as the array method.
>
> -- Alex.
>
>> on way0 pA, pB, @inAnotB, @inBnotA
>>   -- NB data sets must NOT be passed by ref because they are modified
>>   local t1, tA, tB, LA, LB, tLastDup
>>   put the millisecs into t1
>>   sort lines of pA
>>   sort lines of pB
>>   put "way 0 sorting " && the millisecs - t1 & CR after field "F"
>>
>>   -- now start the compare
>>   put 1 into tA
>>   put 1 into tB
>>   put empty into tLastDup
>>   repeat forever
>>      if tA >= the number of chars in pA then
>>         put char tB to -1 of pB after inBnotA
>>         exit repeat
>>      end if
>>      if tB >= the number of chars in pB then
>>         put char tA to -1 of pA after inAnotB
>>         exit repeat
>>      end if
>>      put line 1 of (char tA to -1 of pA) into LA
>>      put line 1 of (char tB to -1 of pB) into LB
>>      switch
>>         case LA = LB
>>            put  LA into tLastDup
>>            add (the number of chars in LA + 1) to tA
>>            add (the number of chars in LB + 1) to tB
>>            break
>>
>>         case LA < LB
>>            if LA <> tLastDup then
>>               put LA & CR after inAnotB
>>               put empty into tLastDup
>>            end if
>>            add the number of chars in LA+1 to tA
>>            break
>>
>>         case LA > LB
>>            if LB <> tLastDup then
>>               put LB & CR after inBnotA
>>               put empty into tLastDup
>>            end if
>>            add the number of chars in LB+1 to tB
>>            break
>>
>>      end switch
>>
>>   end repeat
>>   put "way0" && the millisecs-t1 &&  the number of lines in inAnotB && the
>> number of lines in inBnotA &CR after field "F"
>>
>> end way0
>>
>>
>
> ______________________________**_________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/**mailman/listinfo/use-livecode<http://lists.runrev.com/mailman/listinfo/use-livecode>
>
>



More information about the use-livecode mailing list