Sample code for reading a CSV file
Richard Gaskin
ambassador at fourthworld.com
Thu Feb 17 19:30:58 EST 2011
Alex Tweedly wrote:
> Many years ago (2004 ?) I posted code to do something like his using
> split/combine to differentiate between 'inside' and 'outside' field
> delimiters. It was very fast - but pretty hard to follow, and I don't
> remember now which obscure cases it handled (we haven't even mentioned
> doubled characters and backslash escaped field delimiters yet :-)
...and then there's in-data returns and other anomalies to account for,
which were quite challenging when this was hashed this out here back in '04.
A popular data set for testing the effectiveness of a CSV parser is this
one, which we find on many pages discussing the evils of CSV:
FirstName,LastName,Address,City,State,Zip
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123
Handy as it is, I like to spice it up by including an in-data return,
which is an acceptable practice in CSV (note "At the Plaza"):
FirstName,LastName,Address,City,State,Zip
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At
the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123
Back in that '04 discussion we kicked around a number of algos here and
from that I came up with one that walked through each character, keeping
track of when it was inside of quoted data and when it wasn't so it
could assemble the resulting tab-delimited data sensibly. It was
accurate, but slow.
Famous for being able to speed up darn near any parsing task with a
split command, you modified the algo to use an array, and even with the
overhead of split it was about five times faster.
I had integrated that idea into the original accurate-but-slow function,
and with your mods it was then accurate-and-fast.
But three months later you revisited the thread to note an anomaly with
Rev's handling of array keys which required one more modification in
order to remain robust across larger data sets.
The final result --- with your note on the array key mod -- is here, and
in my tests it not only outperforms most other alternatives but also
more accurately preserves in-data quotes and in-data returns*:
<http://lists.runrev.com/pipermail/use-livecode/2004-October/045496.html>
At the end of that post you noted:
Obviously it will be slower - but "slow and correct" beats "fast
and wrong" :-)
FWIW, more recent testing shows the difference in applying your last mod
to be about a microsecond using the test data above, not the sort of
speed impairment worth worrying about.
Good moral upbringing and a sense of responsibility to humanity compels
me to note Postel's Law in any discussion of the ridiculously insane
inefficiency inherent in parsing CSV:
"Be liberal in what you accept, and
conservative in what you send"
<http://www.ietf.org/rfc/rfc1122.txt>
While it may be necessary from time to time to be able to import CSV,
the format is long overdue for extinction and should, for the benefit of
a saner and more productive world, never be exported.
A longer rant on this is here for those amused by such things:
<http://lists.runrev.com/pipermail/use-livecode/2010-March/136194.html>
:)
* The resulting tab-delimited format follows the convention used by
FileMaker Pro and others to use ASCII 11 as the replacement for return
within data.
--
Richard Gaskin
Fourth World
LiveCode training and consulting: http://www.fourthworld.com
Webzine for LiveCode developers: http://www.LiveCodeJournal.com
LiveCode Journal blog: http://LiveCodejournal.com/blog.irv
More information about the use-livecode
mailing list