Problem parsing data in Gigabyte size text files

Dave dave at looktowindward.com
Fri Jul 6 05:41:44 EDT 2007


Hi,

That sounds like a better approach to me too, however if the problem  
is because the file is > 2GB (or whatever the limit is on Windows)  
then it still won't work.

All the Best
Dave

On 5 Jul 2007, at 13:20, Andre Garzia wrote:

> Alejandro,
> if this is that kind of XML that has a simple record structure and is
> repeated over and over again like a phone book, then why don't you  
> break it
> into smaller edible chunks and insert it into something like SQLite or
> Valentina chunk by chunk. By using a RDBMS you'll be able to query  
> and make
> sense of the XML data easily, and those databases will have no problem
> dealing with large data sets.
>
> because, even if you manage to load 8gb of data in Rev,  
> manipulating it will
> be kind slow I think, just imagine the loops needed to make cross  
> references
> like find everyone who was born in july and is between 30 and 40  
> years....
>
> I'd make a little software to go piece by piece inserting this into a
> database and then begin again from there.
>
> Andre
>
> On 7/4/07, Alejandro Tejada <capellan2000 at yahoo.com> wrote:
>>
>> Hi all,
>>
>> Recently, i was extracting data
>> from a 8 gigabyte ANSI text file
>> (a XML customer database), but after
>> processing approximately 3.5 gigabyte
>> of data, Revolution quits itself and
>> Windows XP presents the familiar dialog
>> asking to notify the Developer of this
>> error.
>>
>> The log file that i saved, while using
>> the stack, shows that after reading character
>> 3,758,096,384 (that is more than 3 thousand million
>> of characters) the stack could not read anymore
>> into the XML database and start repeating the
>> same last line of text that it reads.
>>
>> Notice that i checked the processor and memory use
>> with Windows Task Manager and everything was normal.
>> The stack was using between a 30 to 70 % of processor
>> and memory use was between 45 MB and 125 MB.
>>
>> The code used is similar to this:
>>
>> repeat until tCounter = 8589934592 -- 8 Gigabites
>> read from file tData from char tCounter for 10000
>> -- reading 10,000 characters from database
>> -- these character are placed in the variable: it
>> put processDATA(it) into tProcessedData
>> write tProcessedData to tNewFile
>> put tCounter && last line of it & cr after URL tLOG
>> add 10000 to tCounter
>> end repeat
>>
>> etc...
>>
>> I have repeated the test at least 3 times :((
>> and the results are almost the same, with a small
>> difference between the character where stack quits,
>> while reading this 8 Gigabyte size XML database.
>>
>> I have checked for strange characters in that part of
>> the database, when i splitted the file in many parts,
>> but have not found any.
>>
>> Every insight that you could provide to process
>> this database from start to end is more
>> than welcome. :)
>>
>> Thanks in advance.
>>
>> alejandro
>>
>>
>> Visit my site:
>> http://www.geocities.com/capellan2000/
>>
>>
>>
>> ____________________________________________________________
>> ________________________
>> Sucker-punch spam with award-winning protection.
>> Try the free Yahoo! Mail Beta.
>> http://advision.webevents.yahoo.com/mailbeta/features_spam.html
>> _______________________________________________
>> use-revolution mailing list
>> use-revolution at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-revolution
>>
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your  
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution




More information about the use-livecode mailing list