CSV again.

Peter Haworth pete at lcsql.com
Mon May 14 16:00:17 EDT 2012


I've just been checking out Alex's new csv parser and it is indeed much
faster than the original, closer to 50% than 40% in my test case.

However, I've also run into a Livecode issue while doing all this.  This
has come up before in the context of what LC thinks is a line, there's a
similar issue/confusion/whatever with items.

Let's say you have a string "1,2,3,4,5,6" - LC thinks there are 6 items in
it, no problem

Now change the string to "1,2,3,4,5,6," (note the trailing comma) - LC
still thinks there are 6 items in that string.

So to LC, "1,2,3,4,5,6" and 1,2,3,4,5,6," are equivalent in terms of the
number of items in them.  In the context of parsing csv files, they
definitely are not.

Pete
lcSQL Software <http://www.lcsql.com>



On Mon, May 7, 2012 at 4:30 PM, Alex Tweedly <alex at tweedly.net> wrote:

> Some years ago, this list discussed the difficulties of parsing
> comma-separated-value file format; Richard Gaskin has a great article about
> it at http://www.fourthworld.com/**embassy/articles/csv-must-die.**html<http://www.fourthworld.com/embassy/articles/csv-must-die.html>
>
> Following that discussion, I came up with some code to parse CSV in
> Livecode which was significantly faster than the straightforwards methods
> (quoted in the above article). At the time, I put that speed gain down to
> two factors
>
> 1. a way of looking at the problem "sideways" that enables a different
> approach
> 2. a 'clever' use of split + array access
>
> Recently the topic came up again, and I looked at the code again; I now
> realize that in fact the speed gain came entirely from the first of those
> two factors, and using split + arrays was not helpful. Livecode's chunk
> handling is (in this case) faster than using arrays (my only excuse is that
> I was new to Livecode, and so I was using techniques I was familiar with
> from other languages). So I revised the code to use chunk handling rather
> than split+arrays, and the resulting code runs about 40% faster, with the
> added benefit of being slightly easier to read and understand.  The only
> slightly mind-bending feature of the new code is the use of
>
>    set the lineDelimiter to quote
>    repeat for each line k in pData ....
>
> I find it hard to think about "lines" that aren't actually lines :-)
>
> So - for anyone who needs or wants more speed, here's the code
>
>  function CSV3Tab pData,pcoldelim
>>  local tNuData -- contains tabbed copy of data
>>  local tReturnPlaceholder -- replaces cr in field data to avoid line
>>  --                       breaks which would be misread as records;
>>  --                       replaced later during dislay
>>  local tEscapedQuotePlaceholder -- used for keeping track of quotes
>>  --                       in data
>>  local tInQuotedText -- flag set while reading data between quotes
>>  local tInsideQuoted, k
>>  --
>>  put numtochar(11) into tReturnPlaceholder -- vertical tab as
>>  --                       placeholder
>>  put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
>>  --                       distinction between quotes in data and those
>>  --                       used in delimiters
>>  --
>>  if pcoldelim is empty then put comma into pcoldelim
>>  -- Normalize line endings:
>>  replace crlf with cr in pData          -- Win to UNIX
>>  replace numtochar(13) with cr in pData -- Mac to UNIX
>>  --
>>  -- Put placeholder in escaped quote (non-delimiter) chars:
>>  replace ("\"&quote) with tEscapedQuotePlaceholder in pData
>>  replace quote&quote with tEscapedQuotePlaceholder in pData
>>  --
>>  put space before pData   -- to avoid ambiguity of starting context
>>  put False into tInsideQuoted
>>  set the linedel to quote
>>  repeat for each line k in pData
>>    if (tInsideQuoted) then
>>      replace cr with tReturnPlaceholder in k
>>      put k after tNuData
>>      put False into tInsideQuoted
>>    else
>>      replace pcoldelim with numtochar(29) in k
>>      put k after tNuData
>>      put true into tInsideQuoted
>>    end if
>>  end repeat
>>  --
>>  delete char 1 of tNuData -- remove the leading space
>>  replace tEscapedQuotePlaceholder with quote in tNuData
>>  return tNuData
>> end CSV3Tab
>>
>>
> -- Alex.
>
> ______________________________**_________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/**mailman/listinfo/use-livecode<http://lists.runrev.com/mailman/listinfo/use-livecode>
>



More information about the Use-livecode mailing list