Mon May 7 19:30:08 EDT 2012

Some years ago, this list discussed the difficulties of parsing 
comma-separated-value file format; Richard Gaskin has a great article 
about it at http://www.fourthworld.com/embassy/articles/csv-must-die.html

Following that discussion, I came up with some code to parse CSV in 
Livecode which was significantly faster than the straightforwards 
methods (quoted in the above article). At the time, I put that speed 
gain down to two factors

1. a way of looking at the problem "sideways" that enables a different 
2. a 'clever' use of split + array access

Recently the topic came up again, and I looked at the code again; I now 
realize that in fact the speed gain came entirely from the first of 
those two factors, and using split + arrays was not helpful. Livecode's 
chunk handling is (in this case) faster than using arrays (my only 
excuse is that I was new to Livecode, and so I was using techniques I 
was familiar with from other languages). So I revised the code to use 
chunk handling rather than split+arrays, and the resulting code runs 
about 40% faster, with the added benefit of being slightly easier to 
read and understand.  The only slightly mind-bending feature of the new 
code is the use of

     set the lineDelimiter to quote
     repeat for each line k in pData ....

I find it hard to think about "lines" that aren't actually lines :-)

So - for anyone who needs or wants more speed, here's the code

> function CSV3Tab pData,pcoldelim
>   local tNuData -- contains tabbed copy of data
>   local tReturnPlaceholder -- replaces cr in field data to avoid line
>   --                       breaks which would be misread as records;
>   --                       replaced later during dislay
>   local tEscapedQuotePlaceholder -- used for keeping track of quotes
>   --                       in data
>   local tInQuotedText -- flag set while reading data between quotes
>   local tInsideQuoted, k
>   --
>   put numtochar(11) into tReturnPlaceholder -- vertical tab as
>   --                       placeholder
>   put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
>   --                       distinction between quotes in data and those
>   --                       used in delimiters
>   --
>   if pcoldelim is empty then put comma into pcoldelim
>   -- Normalize line endings:
>   replace crlf with cr in pData          -- Win to UNIX
>   replace numtochar(13) with cr in pData -- Mac to UNIX
>   --
>   -- Put placeholder in escaped quote (non-delimiter) chars:
>   replace ("\"&quote) with tEscapedQuotePlaceholder in pData
>   replace quote&quote with tEscapedQuotePlaceholder in pData
>   --
>   put space before pData   -- to avoid ambiguity of starting context
>   put False into tInsideQuoted
>   set the linedel to quote
>   repeat for each line k in pData
>     if (tInsideQuoted) then
>       replace cr with tReturnPlaceholder in k
>       put k after tNuData
>       put False into tInsideQuoted
>     else
>       replace pcoldelim with numtochar(29) in k
>       put k after tNuData
>       put true into tInsideQuoted
>     end if
>   end repeat
>   --
>   delete char 1 of tNuData -- remove the leading space
>   replace tEscapedQuotePlaceholder with quote in tNuData
>   return tNuData
> end CSV3Tab

-- Alex.

