how to compare 2 very large textfiles

Matthias Rebbe matthias_livecode_150811 at m-r-d.de
Thu Oct 6 17:50:39 EDT 2011


Hi all,

let me explain what i am doing.


I am working on a tool, which checks if  two folders/harddrives are in sync.
I create a detailed file list including all subfolders of each folder/drive. This 2 list are then
compared.

For my tests i created a stack with 2 Datagrids. I read in the complete detailed file/folder structure of my harddisk drive (with filename, size and modifaction date/time - tab seperated) and putted this list in both DataGrids.
A line for example looks like this
 
/Xcode 4.3/Applications/Audio/AU Lab Documentation/AULabHelp/AddingBuses.html	5284	17.08.10 18:49

In one of the Datagrids i changed some lines, just to get some "missing/wrong" lines.

The  file/folder  list had 173910 lines. With my script i even did not get a result after 30 minutes. 
With Pete´s script i get a result after about 10 seconds. And that for both repeat loops. 
I get all wrong/missing lines listed.
Is it really possible, that there is such a performance improvement?
Maybe this is a cache thing?
I tried also with my iTunes folder with 4179 files in it.
My script needs about 6 seconds to finish. Pete´s script less than 4 seconds.

But anyway. The array solution is definitely faster.

Here are the script which i used for testing.

My script

put the dgtext of grp "Festplatte 1" into tHDD1
put the dgtext of grp "Festplatte 2" into tHDD2

REPEAT FOR each line i in tHDD1
      IF  i is not among the lines of tHDD2 THEN put i & return after tMissingInHDD1
END REPEAT
answer the number of lines of tHDD1 &return&tMissingInHDD1   	


Pete´s script (slightly adjusted)

put the dgtext of grp "Festplatte 1" into tHDD1
put the dgtext of grp "Festplatte 2" into tHDD2
      
REPEAT FOR each line i in tHDD1
     put true into myArray[i]["A"]
END REPEAT
   
REPEAT FOR each line i in tHDD2
      put true into myArray[i]["B"]
END REPEAT
   
REPEAT FOR each line k in the keys of myArray
      IF myArray[k]["A"] is not true THEN put k & return  after tMissingInHDD1
      IF myArray[k]["B"] is not true THEN put k & return  after tMissingInHDD2
END REPEAT
answer the number of lines of tHDD1 &return&tMissingInHDD1 &return&tMissingInHDD2
   

Regards,

Matthias





_____________________________________
Matthias Rebbe
Bramkampsieke 13
D-32312 Lübbecke
 
Tel     +49 57 41 - 31 00 00
mobil +49  160 - 550 44 62
Fax    +49 57 41 - 310 0 02
 
E-Mail matthias at matthiasrebbe.eu
http://www.matthiasrebbe.eu

Am 06.10.2011 um 21:32 schrieb Pete:

> Glad it worked Matthias.  Could you give us an idea of the new timing using
> the arrays?
> Pete
> Molly's Revenge <http://www.mollysrevenge.com>
> 
> 
> 
> 
> On Thu, Oct 6, 2011 at 12:17 PM, Matthias Rebbe <
> matthias_livecode_150811 at m-r-d.de> wrote:
> 
>> Hi Pete,
>> 
>> thank you very much. It´s so much faster.
>> 
>> It seems, i should look closer to arrays.
>> 
>> 
>> Regards,
>> 
>> Matthias
>> Am 06.10.2011 um 01:13 schrieb Pete:
>> 
>>> I've used an array to do this type of operation in the past.  Haven't
>> tried
>>> this code but it might work better.
>>> 
>>> repeat for each line i in tTextA
>>> put true into myArray[i]["A"]
>>> end repeat
>>> 
>>> repeat for each line i in tTextB
>>> put true into myArray[i]["B"]
>>> end repeat
>>> 
>>> repeat for each line k in the keys of myArray
>>> if myArray[k]["A"] is not true then put k & return after after
>> tMissingInA
>>> if myArray[k]["B"] is not true then put k & return after after
>> tMissingInB
>>> end repeat
>>> 
>>> Pete
>>> Molly's Revenge <http://www.mollysrevenge.com>
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Oct 5, 2011 at 3:00 PM, Matthias Rebbe <
>>> matthias_livecode_150811 at m-r-d.de> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> i need to compare two very large text files with about 5000 - 7000 lines
>>>> each with a lines size of up to 256 chars.
>>>> 
>>>> I need to find out if there are lines missing in either file a or file
>> b.
>>>> 
>>>> What is the best way to do this with good speed?
>>>> 
>>>> I tried to check each line in file a and if the line is in file b.
>>>> And after that, i check for each line in file b and try to find out
>>>> if the line is in file a.
>>>> 
>>>> With large files it takes about 10 to 15 minutes to do the complete
>> check.
>>>> 
>>>> My script looks like this
>>>> 
>>>> repeat for each line i in tTextA
>>>> if i is not among the lines of tTextB then put i &return after
>> tMissingInB
>>>> end repeat
>>>> 
>>>> repeat for each line i in tTextB
>>>> if i is not among the lines of tTextA then put i &retrurn after
>> tMissingInA
>>>> end repeat
>>>> 
>>>> Is there a better (faster) way?
>>>> 
>>>> Regards,
>>>> 
>>>> Matthias
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>> subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>> 
>>>> 
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> 
>> 
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> 
>> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode





More information about the use-livecode mailing list