Another Revolution Success Story
Alex Tweedly
alex at tweedly.net
Thu Jul 1 11:06:34 EDT 2004
At 00:25 01/07/2004 -0700, Richard Gaskin wrote:
>I asked around on this some time ago, including quite a few programmers
>far smarter than me. The best algorithm we could come up with was one
>which walks through the data char by char, keeping track of when it's in
>field data and when it leaves the field, noting that commas are escaped
>inconsistently in MS products and not all fields have their data
>enclosed in quotes (FM Pro-exported CSV does, but it's a smarter tool in
>general than most of the oddities that come out of Redmond <g>).
>
>My post from 14 June 2002 with my own CSV2Tab function is at
><http://lists.runrev.com/pipermail/metacard/2002-June/001767.html>.
>
>Hats off to anyone who can improve it's speed, and a bottle of
>12-year-old single malt to anyone who can come up with an algorithm I
>can use which is at least twice as fast.
Now there's a challenge I can relate to :-)
BUT - the speed of the conversion depends on the data ...
Enclosed below is a version of the script which is between 10% and 90%
faster - and probably has potential to go even faster than that.
It uses the same set up as you did - so there are no quotes left except
those around fields.
Then instead of walking through the data char by char, it use "split()" to
divide into an array; the array elements must then alternate between
in-quotes and not-in-quotes.
Each array element has only the relevant processing applied.
Note - the speed of the original is (roughly) based on the number of
characters, while the speed of the new version is (very roughly) based on
the number of quoted fields - so for a file of mainly short fields, all of
which are quoted, it is only 10% or so faster (and there could be cases
where it would even be slower). For a file with many unquoted fields, or
where each field is quite large, it will be significantly faster.
function CSV2TabNew pData
local tNuData -- contains tabbed copy of data
local tReturnPlaceholder -- replaces cr in field data to avoid line
-- breaks which would be misread as records;
-- replaced later during dislay
local tEscapedQuotePlaceholder -- used for keeping track of quotes in data
local tInQuotedText -- flag set while reading data between quotes
--
put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
put numtochar(2) into tEscapedQuotePlaceholder -- used to simplify
-- distinction between quotes in data and those
-- used in delimiters
--
-- Normalize line endings:
replace crlf with cr in pData -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX
--
-- Put placeholder in escaped quote (non-delimiter) chars:
replace ("\""e) with tEscapedQuotePlaceholder in pData
--
put space before pData -- to avoid ambiguity of starting context
split pData by quote
put False into tInsideQuoted
repeat for each element k in pData
if (tInsideQuoted) then
replace cr with tReturnPlaceholder in k
put k after tNuData
put False into tInsideQuoted
else
replace comma with tab in k
put k after tNuData
put true into tInsideQuoted
end if
end repeat
--
delete char 1 of tNuData -- remove the leading space
replace tEscapedQuotePlaceholder with quote in tNuData
return tNuData
end CSV2TabNew
Note also - this has about the same number of "fragilities" as the original
(they both fail if the file is mal-formed in about the same number of
ways). They also both fail if the original data contained any "escape"s
(i.e. "\" chars) - they would be doubled in the original data and should be
checked for before the set-up.
>IMNSHO, CSV2Tab should be a built-in function. If there's some agreement
>on this and a willingness to vote for it I'll post the request to Bugzilla.
I'd suggest requesting that it be parameterized to handle the common
variants of quoting and non-quoting. There's a good discussion of the
problem (including ways it can go wrong beyond what we've talked about
here), and a public domain implementation at
http://www.python.org/peps/pep-0305.html#id7
The interface is perhaps wrong for Transcript, but the range of solutions
it covers would be a good place to start.
-- Alex Tweedly.
-------------- next part --------------
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.707 / Virus Database: 463 - Release Date: 15/06/2004
More information about the use-livecode
mailing list