Problem parsing data in Gigabyte size text files

Andre Garzia andre at andregarzia.com
Thu Jul 5 08:20:06 EDT 2007


Alejandro,
if this is that kind of XML that has a simple record structure and is
repeated over and over again like a phone book, then why don't you break it
into smaller edible chunks and insert it into something like SQLite or
Valentina chunk by chunk. By using a RDBMS you'll be able to query and make
sense of the XML data easily, and those databases will have no problem
dealing with large data sets.

because, even if you manage to load 8gb of data in Rev, manipulating it will
be kind slow I think, just imagine the loops needed to make cross references
like find everyone who was born in july and is between 30 and 40 years....

I'd make a little software to go piece by piece inserting this into a
database and then begin again from there.

Andre

On 7/4/07, Alejandro Tejada <capellan2000 at yahoo.com> wrote:
>
> Hi all,
>
> Recently, i was extracting data
> from a 8 gigabyte ANSI text file
> (a XML customer database), but after
> processing approximately 3.5 gigabyte
> of data, Revolution quits itself and
> Windows XP presents the familiar dialog
> asking to notify the Developer of this
> error.
>
> The log file that i saved, while using
> the stack, shows that after reading character
> 3,758,096,384 (that is more than 3 thousand million
> of characters) the stack could not read anymore
> into the XML database and start repeating the
> same last line of text that it reads.
>
> Notice that i checked the processor and memory use
> with Windows Task Manager and everything was normal.
> The stack was using between a 30 to 70 % of processor
> and memory use was between 45 MB and 125 MB.
>
> The code used is similar to this:
>
> repeat until tCounter = 8589934592 -- 8 Gigabites
> read from file tData from char tCounter for 10000
> -- reading 10,000 characters from database
> -- these character are placed in the variable: it
> put processDATA(it) into tProcessedData
> write tProcessedData to tNewFile
> put tCounter && last line of it & cr after URL tLOG
> add 10000 to tCounter
> end repeat
>
> etc...
>
> I have repeated the test at least 3 times :((
> and the results are almost the same, with a small
> difference between the character where stack quits,
> while reading this 8 Gigabyte size XML database.
>
> I have checked for strange characters in that part of
> the database, when i splitted the file in many parts,
> but have not found any.
>
> Every insight that you could provide to process
> this database from start to end is more
> than welcome. :)
>
> Thanks in advance.
>
> alejandro
>
>
> Visit my site:
> http://www.geocities.com/capellan2000/
>
>
>
> ____________________________________________________________
> ________________________
> Sucker-punch spam with award-winning protection.
> Try the free Yahoo! Mail Beta.
> http://advision.webevents.yahoo.com/mailbeta/features_spam.html
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
>



More information about the use-livecode mailing list