Stupid CSV tricks

Richard Gaskin ambassador at FourthWorld.com
Sat Jun 15 12:06:01 EDT 2002


Yennie at aol.com wrote:

> Richard,
> 
> This one isn't any prettier, but on small data sets it seems to run about 5x
> faster.
> I believe the speedup is mostly seen by being able to move on item at a time
> in the repeat loop (rather than one character).
> 
> Basically it does this:
> 1) Stuffs away escaped characters
> 2) Replaces commas with tabs, and erases quotes
> 3) Rebuilds one item at a time, escaping returns only if they are not
> end-of-line
> 4) Brings back escaped characters as unescaped
> 
> Of course the whole thing depends on every line having the same number of
> items- I dunno if that is a given.
> 
> Oh, and it's barely tested =)!
> 
> But I hope it helps.
> 
> function CSV2Tab2 pData
> put numtochar(6) into tEscapedCommaPlaceholder
> put numtochar(11) into tReturnPlaceholder
> put numtochar(2)  into tEscapedQuotePlaceholder
> 
> replace crlf with cr in pData
> replace numtochar(13) with cr in pData
> 
> replace ("\"&quote) with tEscapedQuotePlaceholder in pData
> replace ("\"&comma) with tEscapedCommaPlaceholder in pData
> 
> replace quote with empty in pData
> replace comma with tab in pData
> 
> set the itemDelimiter to tab
> put empty into newData
> put (the number of items in line 1 of pData) into numFields
> put 1 into i
> repeat for each item theItem in pData
> if ((i mod numFields) = 0) then
> put theItem&tab after temp
> put temp after newData
> add 1 to i
> put empty into temp
> else
> replace cr with tReturnPlaceholder in theItem
> put theItem&tab after temp
> end if
> add 1 to i
> end repeat
> 
> set the itemDelimiter to comma
> replace tEscapedCommaPlaceholder with comma in newData
> replace tEscapedQuotePlaceholder with quote in newData
> 
> return newData
> end CSV2Tab2

An excellent speed improvemernt, but alas with one limitation:  I was
mistaken when I wrote that to distinguish delimiter commas from commas in
data, the latter are escaped with "\".  That applies to quote characters,
but in MS CSV commas may exist within the data, and such commas are not
escaped as quotes are. :(  Accordingly, replacing all commas with tabs could
result in data for a given single field being "tabified" into multiple
fields.

CSV is such a silly and inheently inefficient format it's a wonder it was
ever widley adopted.
 
-- 
 Richard Gaskin 
 Fourth World Media Corporation
 Custom Software and Web Development for All Major Platforms
 Developer of WebMerge 2.0: Publish any Database on Any Site
 ___________________________________________________________
 Ambassador at FourthWorld.com       http://www.FourthWorld.com
 Tel: 323-225-3717                       AIM: FourthWorldInc




More information about the metacard mailing list