how to compare 2 very large textfiles

Alex Tweedly alex at tweedly.net
Thu Oct 6 19:29:05 EDT 2011


On 06/10/2011 22:19, Pete wrote:
> Thanks for the report back on the speed Alex.  I guess its academic if the
> speed is down to 100msecs but I'm wondering if a binary search technique
> would be better or worse (assuming the lists were sorted of course).
>
> How did you create the two lists for your test?  I'd like to try the binary
> search but stuck with an easy way to generate two large files like that!
>
> Pete
> Molly's Revenge<http://www.mollysrevenge.com>
>


Pretty basic, I have to admit.

Fill a variable with a large number of characters (all lower case letters).
Insert a CR every 100-300 characters within it.
Then (later) copy  out however many lines you want to use ....

(and to make different lines to test the 'difference' part of it, I 
inserted uppercase 'A's at random places :-)

(see code below)..

btw - binary search won't help (or at least, not very much). To use 
binary search, you need the lines in order. If you have the lines in 
order, then you don't need binary search, you can use a simple 
step-by-step comparison (being very careful about lines which are 
duplicaed in each file !!).

And in any case, to get the lines in order you need to sort them, so 
that will be O(n log n) at best.  That makes it no better than, and 
perhaps not as good as, the array technique, since hash-table lookup is 
likely to be somewhere between constant and 'log n' in time, so the 
whole operation is somewhere between linear and O(n log n) anyway.


As an aside, in case you do ever want to use binary search (or the 
step-by-step comparison) remember you cannot do it efficiently using 
"line X of tData" because that requires scanning to count lines. You 
need to (rather painfully) use char offsets (since they can be accessed 
in constant time).

-- Alex.
> global gData
> on mouseUp
>    local ka, tStartTime, t
>
>    put empty into gData
>    put the millisecs into tStartTime
>    put chartonum("a")-1 into ka
>    constant K=5000000, K1=10000
>    repeat K times
>       put numtochar(random(26) + ka) after gData
>    end repeat
>    put 1 into t
>    repeat until t > K
>       add (100+random(200)) to t
>       put CR into char t of gData
>    end repeat
>    put the number of lines in gData && the millisecs-tStartTime & CR 
> into field "F"
>    put gData after field "F"
> end mouseUp





More information about the use-livecode mailing list