Another Revolution Success Story
Richard Gaskin
ambassador at fourthworld.com
Thu Jul 1 18:02:25 EDT 2004
Alex Tweedly wrote:
> At 00:25 01/07/2004 -0700, Richard Gaskin wrote:
>> My post from 14 June 2002 with my own CSV2Tab function is at
>> <http://lists.runrev.com/pipermail/metacard/2002-June/001767.html>.
>>
>> Hats off to anyone who can improve it's speed, and a bottle of
>> 12-year-old single malt to anyone who can come up with an algorithm I
>> can use which is at least twice as fast.
>
> Now there's a challenge I can relate to :-)
>
> BUT - the speed of the conversion depends on the data ...
> Enclosed below is a version of the script which is between 10% and 90%
> faster - and probably has potential to go even faster than that.
>
> It uses the same set up as you did - so there are no quotes left
> except those around fields.
>
> Then instead of walking through the data char by char, it use
> "split()" to divide into an array; the array elements must then
> alternate between in-quotes and not-in-quotes.
>
> Each array element has only the relevant processing applied.
Great stuff. Using split is a ingenious way to reduce the load.
If CSV were consistently implemented CSV2TabNew would work excellently
right out of the box, but since some CSVs escape quotes by doubling them
I needed to add one line (see below) to also substitute doubled quote
chars with the quote placeholder.
Bonus: since the added line reduces the number of quote characters, the
function is now even faster.
I ran 1000 iterations of all three algorithms on a small test file which
uses the Excel escaping format of doubled quotes (I believe it's MS
Access that uses slash-quote to escape, if memory serves). Average
times on my machine (G4 PowerBook) are roughly:
CSV2Tab: 438ms
CSV2TabNew: 369ms
CSV2Tab3: 195ms
On a slightly more complex example the times were:
CSV2Tab: 892ms
CSV2TabNew: 333ms
CSV2Tab3: 153ms
Here's CSV2Tab3:
function CSV2Tab3 pData
local tNuData -- contains tabbed copy of data
local tReturnPlaceholder -- replaces cr in field data to avoid line
-- breaks which would be misread as records;
-- replaced later during dislay
local tEscapedQuotePlaceholder -- used for keeping track of quotes
-- in data
local tInQuotedText -- flag set while reading data between quotes
--
put numtochar(11) into tReturnPlaceholder -- vertical tab as
-- placeholder
put numtochar(2) into tEscapedQuotePlaceholder -- used to simplify
-- distinction between quotes in data and those
-- used in delimiters
--
-- Normalize line endings:
replace crlf with cr in pData -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX
--
-- Put placeholder in escaped quote (non-delimiter) chars:
replace ("\""e) with tEscapedQuotePlaceholder in pData
replace quote"e with tEscapedQuotePlaceholder in pData --<NEW
--
put space before pData -- to avoid ambiguity of starting context
split pData by quote
put False into tInsideQuoted
repeat for each element k in pData
if (tInsideQuoted) then
replace cr with tReturnPlaceholder in k
put k after tNuData
put False into tInsideQuoted
else
replace comma with tab in k
put k after tNuData
put true into tInsideQuoted
end if
end repeat
--
delete char 1 of tNuData -- remove the leading space
replace tEscapedQuotePlaceholder with quote in tNuData
return tNuData
end CSV2Tab3
Please send me a private email and we'll make arrangements for the
scotch delivery. Thanks for the assist.
--
Richard Gaskin
Fourth World Media Corporation
___________________________________________________
Rev tools and more: http://www.fourthworld.com/rev
More information about the use-livecode
mailing list