LC7 and 8 - Non responsive processing large text files

BNig bernd.niggemann at uni-wh.de
Fri Apr 15 14:49:40 EDT 2016


I once helped someone in the forum
In 2009 someone at the forum wanted to index Wikipedia for "title". It took
him 3 days to complete an indexing operation on a 23 GB xml file.
http://forums.livecode.com/phpBB2/viewtopic.php?f=9&t=3690

after some optimizations it was down to 30 minutes
http://forums.livecode.com/phpBB2/viewtopic.php?f=9&t=3728


When Roland started this thread I "unearthed" the old code, adapted it
(mostly replacing char by byte) and tweaked it a bit.

I now use a subset of the german Wikipedia.

The file is 1.74 GB contains 395,621 times <Title>text of title</Title>. I
extract those "Titles" and the byte where the "record" starts in the
original file.

I write that information to a file which is 23 MB large.

I get a throughput of roughly 20,000 "records" or "hits" per second. It
takes 20 seconds to gather all 395,621 records including writing out to the
index file.

I am using a SSD.

As Richard says this needs a little tweaking. I found that in LC8 RC1
roughly 80,000 bytes per file access give best performance on my system a
MacbookPro mid 2010. In LC 6 it is about 1 Mb per file access. (LC 6.7.10 is
twice as fast, whereas LC 7.1.3 is about 30% slower)

And every 1000 records when writing data out I throw in a "wait 0
milliseconds with messages"

I can even type in a field without problem while indexing is running. 

This all is done using "binary read", simple "read" more than doubles the
time needed. Of course this depends on your data if binary read is ok for
you.


So definitely one can process huge data files in LC without problem if one
adapts the code to the problem.

Doing this I discovered that LC 8 does not return "EOF" in the result when
attempting to read past the end of the file.
I reported the bug
http://quality.livecode.com/show_bug.cgi?id=17413

reported
2016-04-15 09:35 BST

merged
2016-04-15 11:38 BST

This must be one of the fastest bug-fixes on record, 2 hours from reporting
to "awaiting merge".

Hats off to Mark Waddingham and the team.

It will be fixed in LC 8 RC2

Kind regards
Bernd







--
View this message in context: http://runtime-revolution.278305.n4.nabble.com/LC7-and-8-Non-responsive-processing-large-text-files-tp4703419p4703566.html
Sent from the Revolution - User mailing list archive at Nabble.com.




More information about the use-livecode mailing list