how to compare 2 very large textfiles

Pete pete at mollysrevenge.com
Thu Oct 6 19:02:24 EDT 2011


Hi Scott,
Basically the array ends up with a 2 level key structure.  The first level
is keyed by the contents of the line, the second level contains one or two
keys, A and B.  If the line is present in list A and list B then the subkeys
A and B will both contain true.  If the line is not present in list A, then
there won't be a subkey A for that line, similarly if it's not in list B
then there won't be a subkey B for that line

The first 2 repeat loops set up that array structure.  The last repeat loop
check for subkeys A and B of each main key being not true (which is the case
if the subkey doesn't exist) and adds the main key (which is the line
contents) to the list of lines missing in each list.

That was harder to explain than writing the code!

Pete
Molly's Revenge <http://www.mollysrevenge.com>




On Thu, Oct 6, 2011 at 3:28 PM, Scott Rossi <scott at tactilemedia.com> wrote:

> Pete, I meant to ask, how does your array solution work?  Where does the
> comparison take place?  I've long used arrays for storing data but not much
> beyond that.
>
> Thanks & Regards,
>
> Scott Rossi
> Creative Director
> Tactile Media, UX Design
>
>
>
> Recently, Pete wrote:
>
> > Thanks for the report back on the speed Alex.  I guess its academic if
> the
> > speed is down to 100msecs but I'm wondering if a binary search technique
> > would be better or worse (assuming the lists were sorted of course).
> >
> > How did you create the two lists for your test?  I'd like to try the
> binary
> > search but stuck with an easy way to generate two large files like that!
> >
> > Pete
> > Molly's Revenge <http://www.mollysrevenge.com>
> >
> >
> >
> >
> > On Thu, Oct 6, 2011 at 1:17 PM, Alex Tweedly <alex at tweedly.net> wrote:
> >
> >> Much faster.
> >>
> >> I tried the original script (with typo fixed) on 7000 lines of varying
> >> length between 100 and 300 chars - took about 2 minutes to run. The
> array
> >> version (again with typo fixed) took around 100 msec.
> >>
> >> -- Alex.
> >>
> >> On 06/10/2011 20:16, Scott Rossi wrote:
> >>
> >>> FWIW, I tried a quick test of Matthias's script using two fields with
> 5000
> >>> lines of 256 chars each.  I tried using "i is not among the lines of"
> and
> >>> "i
> >>> is not in" with identical results.  Processing time was 1 min 6 secs in
> >>> both
> >>> cases (Mac Intel Core2 Duo).  Perhaps the array option posted is
> faster.
> >>>
> >>> Regards,
> >>>
> >>> Scott Rossi
> >>> Creative Director
> >>> Tactile Media, UX Design
> >>>
> >>>
> >>>
> >>> Recently, Michael Kann wrote:
> >>>
> >>>  Matthias,
> >>>>
> >>>> Your script should take a few seconds at most. There must be something
> >>>> else
> >>>> going on to slow you down. If you want to post the script itself and a
> >>>> few
> >>>> lines of data perhaps someone can figure it out.
> >>>>
> >>>> Mike
> >>>>
> >>>> --- On Wed, 10/5/11, Matthias
> >>>> Rebbe<matthias_livecode_**150811 at m-r-d.de<
> matthias_livecode_150811 at m-r-d.de
> >>>> >>
> >>>>  wrote:
> >>>>
> >>>> From: Matthias
> >>>>
> Rebbe<matthias_livecode_**150811 at m-r-d.de<
> matthias_livecode_150811 at m-r-d.de>>>>
> >
> >>>>>
> >>>> Subject: how to compare 2 very large textfiles
> >>>> To: "How to use
> >>>> LiveCode"<use-livecode at lists.**runrev.com<
> use-livecode at lists.runrev.com>
> >>>>>
> >>>> Date: Wednesday, October 5, 2011, 5:00 PM
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> i need to compare two very large text files with about 5000 - 7000
> lines
> >>>> each
> >>>> with a lines size of up to 256 chars.
> >>>>
> >>>> I need to find out if there are lines missing in either file a or file
> b.
> >>>>
> >>>> What is the best way to do this with good speed?
> >>>>
> >>>> I tried to check each line in file a and if the line is in file b.
> >>>> And after that, i check for each line in file b and try to find out
> >>>> if the line is in file a.
> >>>>
> >>>> With large files it takes about 10 to 15 minutes to do the complete
> >>>> check.
> >>>>
> >>>> My script looks like this
> >>>>
> >>>> repeat for each line i in tTextA
> >>>> if i is not among the lines of tTextB then put i&return after
> tMissingInB
> >>>> end repeat
> >>>>
> >>>> repeat for each line i in tTextB
> >>>> if i is not among the lines of tTextA then put i&retrurn after
> >>>> tMissingInA
> >>>> end repeat
> >>>>
> >>>> Is there a better (faster) way?
> >>>>
> >>>> Regards,
> >>>>
> >>>> Matthias
> >>>>
> >>>
> >>>
> >>> ______________________________**_________________
> >>> use-livecode mailing list
> >>> use-livecode at lists.runrev.com
> >>> Please visit this url to subscribe, unsubscribe and manage your
> >>> subscription preferences:
> >>> http://lists.runrev.com/**mailman/listinfo/use-livecode<
> http://lists.runrev.
> >>> com/mailman/listinfo/use-livecode>
> >>>
> >>>
> >>
> >> ______________________________**_________________
> >> use-livecode mailing list
> >> use-livecode at lists.runrev.com
> >> Please visit this url to subscribe, unsubscribe and manage your
> >> subscription preferences:
> >> http://lists.runrev.com/**mailman/listinfo/use-livecode<
> http://lists.runrev.c
> >> om/mailman/listinfo/use-livecode>
> >>
> >>
> > _______________________________________________
> > use-livecode mailing list
> > use-livecode at lists.runrev.com
> > Please visit this url to subscribe, unsubscribe and manage your
> subscription
> > preferences:
> > http://lists.runrev.com/mailman/listinfo/use-livecode
>
>
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
>



More information about the use-livecode mailing list