CSV again.
Alex Tweedly
alex at tweedly.net
Mon May 7 19:30:08 EDT 2012
Some years ago, this list discussed the difficulties of parsing
comma-separated-value file format; Richard Gaskin has a great article
about it at http://www.fourthworld.com/embassy/articles/csv-must-die.html
Following that discussion, I came up with some code to parse CSV in
Livecode which was significantly faster than the straightforwards
methods (quoted in the above article). At the time, I put that speed
gain down to two factors
1. a way of looking at the problem "sideways" that enables a different
approach
2. a 'clever' use of split + array access
Recently the topic came up again, and I looked at the code again; I now
realize that in fact the speed gain came entirely from the first of
those two factors, and using split + arrays was not helpful. Livecode's
chunk handling is (in this case) faster than using arrays (my only
excuse is that I was new to Livecode, and so I was using techniques I
was familiar with from other languages). So I revised the code to use
chunk handling rather than split+arrays, and the resulting code runs
about 40% faster, with the added benefit of being slightly easier to
read and understand. The only slightly mind-bending feature of the new
code is the use of
set the lineDelimiter to quote
repeat for each line k in pData ....
I find it hard to think about "lines" that aren't actually lines :-)
So - for anyone who needs or wants more speed, here's the code
> function CSV3Tab pData,pcoldelim
> local tNuData -- contains tabbed copy of data
> local tReturnPlaceholder -- replaces cr in field data to avoid line
> -- breaks which would be misread as records;
> -- replaced later during dislay
> local tEscapedQuotePlaceholder -- used for keeping track of quotes
> -- in data
> local tInQuotedText -- flag set while reading data between quotes
> local tInsideQuoted, k
> --
> put numtochar(11) into tReturnPlaceholder -- vertical tab as
> -- placeholder
> put numtochar(2) into tEscapedQuotePlaceholder -- used to simplify
> -- distinction between quotes in data and those
> -- used in delimiters
> --
> if pcoldelim is empty then put comma into pcoldelim
> -- Normalize line endings:
> replace crlf with cr in pData -- Win to UNIX
> replace numtochar(13) with cr in pData -- Mac to UNIX
> --
> -- Put placeholder in escaped quote (non-delimiter) chars:
> replace ("\""e) with tEscapedQuotePlaceholder in pData
> replace quote"e with tEscapedQuotePlaceholder in pData
> --
> put space before pData -- to avoid ambiguity of starting context
> put False into tInsideQuoted
> set the linedel to quote
> repeat for each line k in pData
> if (tInsideQuoted) then
> replace cr with tReturnPlaceholder in k
> put k after tNuData
> put False into tInsideQuoted
> else
> replace pcoldelim with numtochar(29) in k
> put k after tNuData
> put true into tInsideQuoted
> end if
> end repeat
> --
> delete char 1 of tNuData -- remove the leading space
> replace tEscapedQuotePlaceholder with quote in tNuData
> return tNuData
> end CSV3Tab
>
-- Alex.
More information about the use-livecode
mailing list