Processing Big-ish Data

Richard Gaskin ambassador at fourthworld.com
Thu Sep 12 13:15:06 EDT 2013


Geoff Canyon wrote:

 > I opened my mouth at work when I shouldn't, and now I'm writing a
 > function to process server log files: multi-gigabytes of data, and
 > tens of million rows of data. Speed optimization will be key...

What sort of processing are you doing on those logs? What are you 
looking for that you can't get with Google Analytics?  And how big is 
the resulting data, and where does it go, to a DB or another text file 
or piped to something else?

I had a similar task a while back, and wrote a command that reads in 
chunks according to a specified buffer size, parsing the buffer by a 
specified delimiter.  It dispatches callbacks for each element so I 
could use it as a sort of ersatz MapReduce, keeping the element parsing 
separate from the processing, allowing me use it in different contexts 
without having to rewrite the buffering stuff each time.

While dispatch is measurably a little faster than send, it still eats 
more time than processing in-line.  For my needs it was a reasonably 
efficient trade-off: in a test case where the processing callback merely 
obtains the second item from the element passed to it and appends it to 
a list, with a read buffer size of just 128k it churns through a 845 MB 
file containing 758,721 elements in under 10 seconds.

But on a collection as large as your files, the benefits of a 
generalized approach using callbacks may be lost by the time consumed by 
dispatch.

Interestingly, I seem to have stumbled across a bit of a 
counter-intuitive relationship between buffer size and performance.  I 
had expected that using the largest-possible buffer size would always be 
faster since it's making fewer disk accesses.  But apparently the 
overhead within LC to allocate blocks of memory somewhat mitigates that, 
with the following results:

Buffer size           Total time
-------------------   ----------
2097152 bytes  (2MB)  10.444 seconds
1048576 bytes  (1MB)  10.284 seconds
  524288 bytes (512k)  10.256 seconds
  262144 bytes (256k)   9.384 seconds
  131072 bytes (128k)   9.274 seconds
   65536 bytes  (64k)   9.312 seconds

These are inexact timings, but still the trend is interesting, if indeed 
repeatable (these were one-off tests, and I haven't tested this on 
multiple machines).

Given your background you've probably already decided to avoid using 
"read until cr" for parsing, since of course that requires the engine to 
examine every character in the stream for the delimiter.

But if you haven't yet done much benchmarking on this, reading as binary 
is often much faster than reading as text for similar reasons, since 
binary mode is a raw scrape from disk while text mode translates NULLs 
and line endings on the fly.  In a quckie test of my file read handler, 
using text mode adds nearly 30% to the overall time.

A much smaller benefit is reading in chunk sizes that are a multiple of 
the file system's block size.  On HFS+, NFS, and EXT4 the default is 4k; 
many DB engines use multiples of 4k for their internal blocks for this 
reason, aligning them with the host file I/O.  While the speed 
difference in aligning to the file system block size from a scripting 
language is minimal, with a collection as large as yours it may add up.

--
  Richard Gaskin
  Fourth World
  LiveCode training and consulting: http://www.fourthworld.com
  Webzine for LiveCode developers: http://www.LiveCodeJournal.com
  Follow me on Twitter:  http://twitter.com/FourthWorldSys





More information about the use-livecode mailing list