Comma-delimited values
Richard Gaskin
ambassador at fourthworld.com
Mon Mar 8 13:55:16 EST 2010
Paul D. DeRocco wrote:
> Add an inQuotes flag, with an initial value of false. For each item, if it
> has a quote in it, toggle the inQuotes flag. Then, if inQuotes is set,
> append a comma instead of a tab, to put the item back together again.
Roger that. CSV elements may contain returns within quoted portions,
and escaping uses a wide range of conventions differing from program to
program (within the Microsoft family I've seen it differ from version to
version of the same program). The flag method, while notoriously slow,
is the only reliable method I've found for handling all the many
variants of CSV.
<rant>
A plea for sanity in the software development world:
While we need to write CSV importers from time to time, please never
write CSV exporters.
CSV must die.
The problem with CSV is that the comma is very commonly used in data,
making it a uniquely stupid choice as a delimiter.
That stupidity could be countered with consistent escaping, but no true
standard has emerged since the many decades of this productivity-abuse
began.
Making a stupid decision even more stupid, most CSV files quote
non-numeric values, as though the programmers did not realize the quote
character is commonly used in our language and therefore may likely be
part of the data. So using quote as an escape means that you must escape
the escape. Jeeminy who thinks up that merde?!?!
Sometimes the escape for in-data double-quotes is a double double-quote,
which sometimes makes it hard to know what to do with empty values shown
as "", esp. given that in some formats the empty value abuts the
adjacent commas and in others, like MySQL dumps, it abuts only the
trailing comma but has a space before the leading comma. Other times
double-quotes are escaped with \", meaning you'll need to escape any
in-data backslashes too.
For thinking people, about the time you realize that you're escaping the
escape that you've escaped to handle your data, it would occur to you to
go back and check your original premise. But not so for the creators of
CSV.
As Jim Bufalini pointed out, tab-delimited or even (in fewer cases)
pipe-delimited make much saner options.
For my own delimited exports I've adopted the convention used in
FileMaker Pro and others, with escapes that are unlikely to be in the
data:
- records are delimited with returns
- fields delimited with tabs
- quotes are never added and are included only when they are part
of the data
- any in-data returns are escaped with ASCII 11
- any in-data tabs escaped with ASCII 4
Simple to write, lightning-fast to parse.
When you add up all the programmer- and end-user-hours lost to dealing
with the uniquely stupid collection of mish-mashed ad-hoc formats that
is CSV, it amounts to nothing less than a crime against humanity.
Several hundred if not thousands of human lifetimes have been lost
either dealing with bugs related to CSV parsers, or simply waiting for
the inherently slow parsing of CSV that could have taken mere
milliseconds if done in any saner format.
CSV must die.
Please help it die: never write CSV exporters.
</rant>
--
Richard Gaskin
Fourth World
Rev training and consulting: http://www.fourthworld.com
Webzine for Rev developers: http://www.revjournal.com
revJournal blog: http://revjournal.com/blog.irv
More information about the use-livecode
mailing list