how to compare 2 very large textfiles
alex at tweedly.net
Thu Oct 6 19:29:05 EDT 2011
On 06/10/2011 22:19, Pete wrote:
> Thanks for the report back on the speed Alex. I guess its academic if the
> speed is down to 100msecs but I'm wondering if a binary search technique
> would be better or worse (assuming the lists were sorted of course).
> How did you create the two lists for your test? I'd like to try the binary
> search but stuck with an easy way to generate two large files like that!
> Molly's Revenge<http://www.mollysrevenge.com>
Pretty basic, I have to admit.
Fill a variable with a large number of characters (all lower case letters).
Insert a CR every 100-300 characters within it.
Then (later) copy out however many lines you want to use ....
(and to make different lines to test the 'difference' part of it, I
inserted uppercase 'A's at random places :-)
(see code below)..
btw - binary search won't help (or at least, not very much). To use
binary search, you need the lines in order. If you have the lines in
order, then you don't need binary search, you can use a simple
step-by-step comparison (being very careful about lines which are
duplicaed in each file !!).
And in any case, to get the lines in order you need to sort them, so
that will be O(n log n) at best. That makes it no better than, and
perhaps not as good as, the array technique, since hash-table lookup is
likely to be somewhere between constant and 'log n' in time, so the
whole operation is somewhere between linear and O(n log n) anyway.
As an aside, in case you do ever want to use binary search (or the
step-by-step comparison) remember you cannot do it efficiently using
"line X of tData" because that requires scanning to count lines. You
need to (rather painfully) use char offsets (since they can be accessed
in constant time).
> global gData
> on mouseUp
> local ka, tStartTime, t
> put empty into gData
> put the millisecs into tStartTime
> put chartonum("a")-1 into ka
> constant K=5000000, K1=10000
> repeat K times
> put numtochar(random(26) + ka) after gData
> end repeat
> put 1 into t
> repeat until t > K
> add (100+random(200)) to t
> put CR into char t of gData
> end repeat
> put the number of lines in gData && the millisecs-tStartTime & CR
> into field "F"
> put gData after field "F"
> end mouseUp
More information about the Use-livecode