Comma-delimited values

Richard Gaskin ambassador at fourthworld.com
Mon Mar 8 13:55:16 EST 2010


Paul D. DeRocco wrote:
> Add an inQuotes flag, with an initial value of false. For each item, if it
> has a quote in it, toggle the inQuotes flag. Then, if inQuotes is set,
> append a comma instead of a tab, to put the item back together again.

Roger that.  CSV elements may contain returns within quoted portions, 
and escaping uses a wide range of conventions differing from program to 
program (within the Microsoft family I've seen it differ from version to 
version of the same program).  The flag method, while notoriously slow, 
is the only reliable method I've found for handling all the many 
variants of CSV.

<rant>
A plea for sanity in the software development world:

While we need to write CSV importers from time to time, please never 
write CSV exporters.

CSV must die.

The problem with CSV is that the comma is very commonly used in data, 
making it a uniquely stupid choice as a delimiter.

That stupidity could be countered with consistent escaping, but no true 
standard has emerged since the many decades of this productivity-abuse 
began.

Making a stupid decision even more stupid, most CSV files quote 
non-numeric values, as though the programmers did not realize the quote 
character is commonly used in our language and therefore may likely be 
part of the data. So using quote as an escape means that you must escape 
the escape.  Jeeminy who thinks up that merde?!?!

Sometimes the escape for in-data double-quotes is a double double-quote, 
which sometimes makes it hard to know what to do with empty values shown 
as "", esp. given that in some formats the empty value abuts the 
adjacent commas and in others, like MySQL dumps, it abuts only the 
trailing comma but has a space before the leading comma.  Other times 
double-quotes are escaped with \", meaning you'll need to escape any 
in-data backslashes too.

For thinking people, about the time you realize that you're escaping the 
escape that you've escaped to handle your data, it would occur to you to 
go back and check your original premise.  But not so for the creators of 
CSV.

As Jim Bufalini pointed out, tab-delimited or even (in fewer cases) 
pipe-delimited make much saner options.

For my own delimited exports I've adopted the convention used in 
FileMaker Pro and others, with escapes that are unlikely to be in the 
data:

- records are delimited with returns
- fields delimited with tabs
- quotes are never added and are included only when they are part
   of the data
- any in-data returns are escaped with ASCII 11
- any in-data tabs escaped with ASCII 4

Simple to write, lightning-fast to parse.

When you add up all the programmer- and end-user-hours lost to dealing 
with the uniquely stupid collection of mish-mashed ad-hoc formats that 
is CSV, it amounts to nothing less than a crime against humanity. 
Several hundred if not thousands of human lifetimes have been lost 
either dealing with bugs related to CSV parsers, or simply waiting for 
the inherently slow parsing of CSV that could have taken mere 
milliseconds if done in any saner format.

CSV must die.

Please help it die:  never write CSV exporters.
</rant>

--
  Richard Gaskin
  Fourth World
  Rev training and consulting: http://www.fourthworld.com
  Webzine for Rev developers: http://www.revjournal.com
  revJournal blog: http://revjournal.com/blog.irv



More information about the use-livecode mailing list