Sample code for reading a CSV file

Richard Gaskin ambassador at fourthworld.com
Thu Feb 17 19:30:58 EST 2011


Alex Tweedly wrote:

> Many years ago (2004 ?) I posted code to do something like his using
> split/combine to differentiate between 'inside' and 'outside' field
> delimiters. It was very fast - but pretty hard to follow, and I don't
> remember now which obscure cases it handled (we haven't even mentioned
> doubled characters and backslash escaped field delimiters yet :-)

...and then there's in-data returns and other anomalies to account for, 
which were quite challenging when this was hashed this out here back in '04.

A popular data set for testing the effectiveness of a CSV parser is this 
one, which we find on many pages discussing the evils of CSV:


FirstName,LastName,Address,City,State,Zip
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123


Handy as it is, I like to spice it up by including an in-data return, 
which is an acceptable practice in CSV (note "At the Plaza"):


FirstName,LastName,Address,City,State,Zip
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At
  the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123


Back in that '04 discussion we kicked around a number of algos here and 
from that I came up with one that walked through each character, keeping 
track of when it was inside of quoted data and when it wasn't so it 
could assemble the resulting tab-delimited data sensibly.  It was 
accurate, but slow.

Famous for being able to speed up darn near any parsing task with a 
split command, you modified the algo to use an array, and even with the 
overhead of split it was about five times faster.

I had integrated that idea into the original accurate-but-slow function, 
and with your mods it was then accurate-and-fast.

But three months later you revisited the thread to note an anomaly with 
Rev's handling of array keys which required one more modification in 
order to remain robust across larger data sets.

The final result --- with your note on the array key mod -- is here, and 
in my tests it not only outperforms most other alternatives but also 
more accurately preserves in-data quotes and in-data returns*:

<http://lists.runrev.com/pipermail/use-livecode/2004-October/045496.html>

At the end of that post you noted:

   Obviously it will be slower - but "slow and correct" beats "fast
   and wrong" :-)

FWIW, more recent testing shows the difference in applying your last mod 
to be about a microsecond using the test data above, not the sort of 
speed impairment worth worrying about.



Good moral upbringing and a sense of responsibility to humanity compels 
me to note Postel's Law in any discussion of the ridiculously insane 
inefficiency inherent in parsing CSV:

   "Be liberal in what you accept, and
    conservative in what you send"
<http://www.ietf.org/rfc/rfc1122.txt>

While it may be necessary from time to time to be able to import CSV, 
the format is long overdue for extinction and should, for the benefit of 
a saner and more productive world, never be exported.

A longer rant on this is here for those amused by such things:
<http://lists.runrev.com/pipermail/use-livecode/2010-March/136194.html>

:)


* The resulting tab-delimited format follows the convention used by 
FileMaker Pro and others to use ASCII 11 as the replacement for return 
within data.

--
  Richard Gaskin
  Fourth World
  LiveCode training and consulting: http://www.fourthworld.com
  Webzine for LiveCode developers: http://www.LiveCodeJournal.com
  LiveCode Journal blog: http://LiveCodejournal.com/blog.irv




More information about the Use-livecode mailing list