Another Revolution Success Story

Richard Gaskin ambassador at
Thu Jul 1 17:02:25 CDT 2004

Alex Tweedly wrote:

> At 00:25 01/07/2004 -0700, Richard Gaskin wrote:
>> My post from 14 June 2002 with my own CSV2Tab function is at
>> <>.
>> Hats off to anyone who can improve it's speed, and a bottle of
>> 12-year-old single malt to anyone who can come up with an algorithm I
>> can use which is at least twice as fast.
> Now there's a challenge I can relate to :-)
> BUT - the speed of the conversion depends on the data ...
> Enclosed below is a version of the script which is between 10% and 90%
> faster - and probably has potential to go even faster than that.
> It uses the same set up as you did - so there are no quotes left
> except those around fields.
> Then instead of walking through the data char by char, it use
> "split()" to divide into an array; the array elements must then
> alternate between in-quotes and not-in-quotes.
> Each array element has only the relevant processing applied.

Great stuff.  Using split is a ingenious way to reduce the load.

If CSV were consistently implemented CSV2TabNew would work excellently 
right out of the box, but since some CSVs escape quotes by doubling them 
I needed to add one line (see below) to also substitute doubled quote 
chars with the quote placeholder.

Bonus: since the added line reduces the number of quote characters, the 
function is now even faster.

I ran 1000 iterations of all three algorithms on a small test file which 
uses the Excel escaping format of doubled quotes (I believe it's MS 
Access that uses slash-quote to escape, if memory serves).  Average 
times on my machine (G4 PowerBook) are roughly:

CSV2Tab:    438ms
CSV2TabNew: 369ms
CSV2Tab3:   195ms

On a slightly more complex example the times were:

CSV2Tab:    892ms
CSV2TabNew: 333ms
CSV2Tab3:   153ms

Here's CSV2Tab3:

function CSV2Tab3 pData
   local tNuData -- contains tabbed copy of data
   local tReturnPlaceholder -- replaces cr in field data to avoid line
   --                       breaks which would be misread as records;
   --                       replaced later during dislay
   local tEscapedQuotePlaceholder -- used for keeping track of quotes
   --                       in data
   local tInQuotedText -- flag set while reading data between quotes
   put numtochar(11) into tReturnPlaceholder -- vertical tab as
   --                       placeholder
   put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
   --                       distinction between quotes in data and those
   --                       used in delimiters
   -- Normalize line endings:
   replace crlf with cr in pData          -- Win to UNIX
   replace numtochar(13) with cr in pData -- Mac to UNIX
   -- Put placeholder in escaped quote (non-delimiter) chars:
   replace ("\"&quote) with tEscapedQuotePlaceholder in pData
   replace quote&quote with tEscapedQuotePlaceholder in pData --<NEW
   put space before pData   -- to avoid ambiguity of starting context
   split pData by quote
   put False into tInsideQuoted
   repeat for each element k in pData
     if (tInsideQuoted) then
       replace cr with tReturnPlaceholder in k
       put k after tNuData
       put False into tInsideQuoted
       replace comma with tab in k
       put k after tNuData
       put true into tInsideQuoted
     end if
   end repeat
   delete char 1 of tNuData -- remove the leading space
   replace tEscapedQuotePlaceholder with quote in tNuData
   return tNuData
end CSV2Tab3

Please send me a private email and we'll make arrangements for the 
scotch delivery.  Thanks for the assist.

  Richard Gaskin
  Fourth World Media Corporation
  Rev tools and more:

More information about the use-livecode mailing list