Processing Big-ish Data
ambassador at fourthworld.com
Thu Sep 12 13:15:06 EDT 2013
Geoff Canyon wrote:
> I opened my mouth at work when I shouldn't, and now I'm writing a
> function to process server log files: multi-gigabytes of data, and
> tens of million rows of data. Speed optimization will be key...
What sort of processing are you doing on those logs? What are you
looking for that you can't get with Google Analytics? And how big is
the resulting data, and where does it go, to a DB or another text file
or piped to something else?
I had a similar task a while back, and wrote a command that reads in
chunks according to a specified buffer size, parsing the buffer by a
specified delimiter. It dispatches callbacks for each element so I
could use it as a sort of ersatz MapReduce, keeping the element parsing
separate from the processing, allowing me use it in different contexts
without having to rewrite the buffering stuff each time.
While dispatch is measurably a little faster than send, it still eats
more time than processing in-line. For my needs it was a reasonably
efficient trade-off: in a test case where the processing callback merely
obtains the second item from the element passed to it and appends it to
a list, with a read buffer size of just 128k it churns through a 845 MB
file containing 758,721 elements in under 10 seconds.
But on a collection as large as your files, the benefits of a
generalized approach using callbacks may be lost by the time consumed by
Interestingly, I seem to have stumbled across a bit of a
counter-intuitive relationship between buffer size and performance. I
had expected that using the largest-possible buffer size would always be
faster since it's making fewer disk accesses. But apparently the
overhead within LC to allocate blocks of memory somewhat mitigates that,
with the following results:
Buffer size Total time
2097152 bytes (2MB) 10.444 seconds
1048576 bytes (1MB) 10.284 seconds
524288 bytes (512k) 10.256 seconds
262144 bytes (256k) 9.384 seconds
131072 bytes (128k) 9.274 seconds
65536 bytes (64k) 9.312 seconds
These are inexact timings, but still the trend is interesting, if indeed
repeatable (these were one-off tests, and I haven't tested this on
Given your background you've probably already decided to avoid using
"read until cr" for parsing, since of course that requires the engine to
examine every character in the stream for the delimiter.
But if you haven't yet done much benchmarking on this, reading as binary
is often much faster than reading as text for similar reasons, since
binary mode is a raw scrape from disk while text mode translates NULLs
and line endings on the fly. In a quckie test of my file read handler,
using text mode adds nearly 30% to the overall time.
A much smaller benefit is reading in chunk sizes that are a multiple of
the file system's block size. On HFS+, NFS, and EXT4 the default is 4k;
many DB engines use multiples of 4k for their internal blocks for this
reason, aligning them with the host file I/O. While the speed
difference in aligning to the file system block size from a scripting
language is minimal, with a collection as large as yours it may add up.
LiveCode training and consulting: http://www.fourthworld.com
Webzine for LiveCode developers: http://www.LiveCodeJournal.com
Follow me on Twitter: http://twitter.com/FourthWorldSys
More information about the Use-livecode