Another Revolution Success Story

Alex Tweedly alex at tweedly.net
Thu Jul 1 11:06:34 EDT 2004


At 00:25 01/07/2004 -0700, Richard Gaskin wrote:

>I asked around on this some time ago, including quite a few programmers 
>far smarter than me.  The best algorithm we could come up with was one 
>which walks through the data char by char, keeping track of when it's in 
>field data and when it leaves the field, noting that commas are escaped
>inconsistently in MS products and not all fields have their data
>enclosed in quotes (FM Pro-exported CSV does, but it's a smarter tool in
>general than most of the oddities that come out of Redmond <g>).
>
>My post from 14 June 2002 with my own CSV2Tab function is at
><http://lists.runrev.com/pipermail/metacard/2002-June/001767.html>.
>
>Hats off to anyone who can improve it's speed, and a bottle of
>12-year-old single malt to anyone who can come up with an algorithm I
>can use which is at least twice as fast.

Now there's a challenge I can relate to :-)

BUT - the speed of the conversion depends on the data ...
Enclosed below is a version of the script which is between 10% and 90% 
faster - and probably has potential to go even faster than that.

It uses the same set up as you did - so there are no quotes left except 
those around fields.

Then instead of walking through the data char by char, it use "split()" to 
divide into an array; the array elements must then alternate between 
in-quotes and not-in-quotes.

Each array element has only the relevant processing applied.

Note - the speed of the original is (roughly) based on the number of 
characters, while the speed of the new version is (very roughly) based on 
the number of quoted fields - so for a file of mainly short fields, all of 
which are quoted, it is only 10% or so faster (and there could be cases 
where it would even be slower). For a file with many unquoted fields, or 
where each field is quite large, it will be significantly faster.

function CSV2TabNew pData
   local tNuData -- contains tabbed copy of data
   local tReturnPlaceholder -- replaces cr in field data to avoid line
   --                          breaks which would be misread as records;
   --                          replaced later during dislay
   local tEscapedQuotePlaceholder -- used for keeping track of quotes in data
   local tInQuotedText -- flag set while reading data between quotes
   --
   put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
   put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
   --                       distinction between quotes in data and those
   --                       used in delimiters
   --
   -- Normalize line endings:
   replace crlf with cr in pData          -- Win to UNIX
   replace numtochar(13) with cr in pData -- Mac to UNIX
   --
   -- Put placeholder in escaped quote (non-delimiter) chars:
   replace ("\"&quote) with tEscapedQuotePlaceholder in pData
   --
   put space before pData   -- to avoid ambiguity of starting context
   split pData by quote
   put False into tInsideQuoted
   repeat for each element k in pData
     if (tInsideQuoted) then
       replace cr with tReturnPlaceholder in k
       put k after tNuData
       put False into tInsideQuoted
     else
       replace comma with tab in k
       put k after tNuData
       put true into tInsideQuoted
     end if
   end repeat
   --
   delete char 1 of tNuData -- remove the leading space
   replace tEscapedQuotePlaceholder with quote in tNuData
   return tNuData
end CSV2TabNew

Note also - this has about the same number of "fragilities" as the original 
(they both fail if the file is mal-formed in about the same number of 
ways).  They also both fail if the original data contained any "escape"s 
(i.e. "\" chars) - they would be doubled in the original data and should be 
checked for before the set-up.


>IMNSHO, CSV2Tab should be a built-in function.  If there's some agreement 
>on this and a willingness to vote for it I'll post the request to Bugzilla.

I'd suggest requesting that it be parameterized to handle the common 
variants of quoting and non-quoting. There's a good discussion of the 
problem (including ways it can go wrong beyond what we've talked about 
here), and a public domain implementation at
http://www.python.org/peps/pep-0305.html#id7

The interface is perhaps wrong for Transcript, but the range of solutions 
it covers would be a good place to start.

-- Alex Tweedly.
-------------- next part --------------

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.707 / Virus Database: 463 - Release Date: 15/06/2004


More information about the use-livecode mailing list