how to compare 2 very large textfiles
Matthias Rebbe
matthias_livecode_150811 at m-r-d.de
Thu Oct 6 17:50:39 EDT 2011
Hi all,
let me explain what i am doing.
I am working on a tool, which checks if two folders/harddrives are in sync.
I create a detailed file list including all subfolders of each folder/drive. This 2 list are then
compared.
For my tests i created a stack with 2 Datagrids. I read in the complete detailed file/folder structure of my harddisk drive (with filename, size and modifaction date/time - tab seperated) and putted this list in both DataGrids.
A line for example looks like this
/Xcode 4.3/Applications/Audio/AU Lab Documentation/AULabHelp/AddingBuses.html 5284 17.08.10 18:49
In one of the Datagrids i changed some lines, just to get some "missing/wrong" lines.
The file/folder list had 173910 lines. With my script i even did not get a result after 30 minutes.
With Pete´s script i get a result after about 10 seconds. And that for both repeat loops.
I get all wrong/missing lines listed.
Is it really possible, that there is such a performance improvement?
Maybe this is a cache thing?
I tried also with my iTunes folder with 4179 files in it.
My script needs about 6 seconds to finish. Pete´s script less than 4 seconds.
But anyway. The array solution is definitely faster.
Here are the script which i used for testing.
My script
put the dgtext of grp "Festplatte 1" into tHDD1
put the dgtext of grp "Festplatte 2" into tHDD2
REPEAT FOR each line i in tHDD1
IF i is not among the lines of tHDD2 THEN put i & return after tMissingInHDD1
END REPEAT
answer the number of lines of tHDD1 &return&tMissingInHDD1
Pete´s script (slightly adjusted)
put the dgtext of grp "Festplatte 1" into tHDD1
put the dgtext of grp "Festplatte 2" into tHDD2
REPEAT FOR each line i in tHDD1
put true into myArray[i]["A"]
END REPEAT
REPEAT FOR each line i in tHDD2
put true into myArray[i]["B"]
END REPEAT
REPEAT FOR each line k in the keys of myArray
IF myArray[k]["A"] is not true THEN put k & return after tMissingInHDD1
IF myArray[k]["B"] is not true THEN put k & return after tMissingInHDD2
END REPEAT
answer the number of lines of tHDD1 &return&tMissingInHDD1 &return&tMissingInHDD2
Regards,
Matthias
_____________________________________
Matthias Rebbe
Bramkampsieke 13
D-32312 Lübbecke
Tel +49 57 41 - 31 00 00
mobil +49 160 - 550 44 62
Fax +49 57 41 - 310 0 02
E-Mail matthias at matthiasrebbe.eu
http://www.matthiasrebbe.eu
Am 06.10.2011 um 21:32 schrieb Pete:
> Glad it worked Matthias. Could you give us an idea of the new timing using
> the arrays?
> Pete
> Molly's Revenge <http://www.mollysrevenge.com>
>
>
>
>
> On Thu, Oct 6, 2011 at 12:17 PM, Matthias Rebbe <
> matthias_livecode_150811 at m-r-d.de> wrote:
>
>> Hi Pete,
>>
>> thank you very much. It´s so much faster.
>>
>> It seems, i should look closer to arrays.
>>
>>
>> Regards,
>>
>> Matthias
>> Am 06.10.2011 um 01:13 schrieb Pete:
>>
>>> I've used an array to do this type of operation in the past. Haven't
>> tried
>>> this code but it might work better.
>>>
>>> repeat for each line i in tTextA
>>> put true into myArray[i]["A"]
>>> end repeat
>>>
>>> repeat for each line i in tTextB
>>> put true into myArray[i]["B"]
>>> end repeat
>>>
>>> repeat for each line k in the keys of myArray
>>> if myArray[k]["A"] is not true then put k & return after after
>> tMissingInA
>>> if myArray[k]["B"] is not true then put k & return after after
>> tMissingInB
>>> end repeat
>>>
>>> Pete
>>> Molly's Revenge <http://www.mollysrevenge.com>
>>>
>>>
>>>
>>>
>>> On Wed, Oct 5, 2011 at 3:00 PM, Matthias Rebbe <
>>> matthias_livecode_150811 at m-r-d.de> wrote:
>>>
>>>> Hi,
>>>>
>>>> i need to compare two very large text files with about 5000 - 7000 lines
>>>> each with a lines size of up to 256 chars.
>>>>
>>>> I need to find out if there are lines missing in either file a or file
>> b.
>>>>
>>>> What is the best way to do this with good speed?
>>>>
>>>> I tried to check each line in file a and if the line is in file b.
>>>> And after that, i check for each line in file b and try to find out
>>>> if the line is in file a.
>>>>
>>>> With large files it takes about 10 to 15 minutes to do the complete
>> check.
>>>>
>>>> My script looks like this
>>>>
>>>> repeat for each line i in tTextA
>>>> if i is not among the lines of tTextB then put i &return after
>> tMissingInB
>>>> end repeat
>>>>
>>>> repeat for each line i in tTextB
>>>> if i is not among the lines of tTextA then put i &retrurn after
>> tMissingInA
>>>> end repeat
>>>>
>>>> Is there a better (faster) way?
>>>>
>>>> Regards,
>>>>
>>>> Matthias
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> use-livecode at lists.runrev.com
>>>> Please visit this url to subscribe, unsubscribe and manage your
>>>> subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>>>
>>>>
>>> _______________________________________________
>>> use-livecode mailing list
>>> use-livecode at lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>>
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list