how to compare 2 very large textfiles

Pete pete at mollysrevenge.com
Thu Oct 6 19:45:53 EDT 2011


Thanks Alex.  I managed to cobble something together to get the test lists.


I did try the binary search approach and it was way slower than the array
approach as you predicted (Still much faster than the original code Matthias
was using though).  So I'm happy with the array technique now.  Someone
posted a variation on my original code which might be slightly faster.


Pete
Molly's Revenge <http://www.mollysrevenge.com>




On Thu, Oct 6, 2011 at 4:29 PM, Alex Tweedly <alex at tweedly.net> wrote:

> On 06/10/2011 22:19, Pete wrote:
>
>> Thanks for the report back on the speed Alex.  I guess its academic if the
>> speed is down to 100msecs but I'm wondering if a binary search technique
>> would be better or worse (assuming the lists were sorted of course).
>>
>> How did you create the two lists for your test?  I'd like to try the
>> binary
>> search but stuck with an easy way to generate two large files like that!
>>
>> Pete
>> Molly's Revenge<http://www.**mollysrevenge.com<http://www.mollysrevenge.com>
>> >
>>
>>
>
> Pretty basic, I have to admit.
>
> Fill a variable with a large number of characters (all lower case letters).
> Insert a CR every 100-300 characters within it.
> Then (later) copy  out however many lines you want to use ....
>
> (and to make different lines to test the 'difference' part of it, I
> inserted uppercase 'A's at random places :-)
>
> (see code below)..
>
> btw - binary search won't help (or at least, not very much). To use binary
> search, you need the lines in order. If you have the lines in order, then
> you don't need binary search, you can use a simple step-by-step comparison
> (being very careful about lines which are duplicaed in each file !!).
>
> And in any case, to get the lines in order you need to sort them, so that
> will be O(n log n) at best.  That makes it no better than, and perhaps not
> as good as, the array technique, since hash-table lookup is likely to be
> somewhere between constant and 'log n' in time, so the whole operation is
> somewhere between linear and O(n log n) anyway.
>
>
> As an aside, in case you do ever want to use binary search (or the
> step-by-step comparison) remember you cannot do it efficiently using "line X
> of tData" because that requires scanning to count lines. You need to (rather
> painfully) use char offsets (since they can be accessed in constant time).
>
> -- Alex.
>
>> global gData
>> on mouseUp
>>   local ka, tStartTime, t
>>
>>   put empty into gData
>>   put the millisecs into tStartTime
>>   put chartonum("a")-1 into ka
>>   constant K=5000000, K1=10000
>>   repeat K times
>>      put numtochar(random(26) + ka) after gData
>>   end repeat
>>   put 1 into t
>>   repeat until t > K
>>      add (100+random(200)) to t
>>      put CR into char t of gData
>>   end repeat
>>   put the number of lines in gData && the millisecs-tStartTime & CR into
>> field "F"
>>   put gData after field "F"
>> end mouseUp
>>
>
>
> ______________________________**_________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/**mailman/listinfo/use-livecode<http://lists.runrev.com/mailman/listinfo/use-livecode>
>
>



More information about the use-livecode mailing list