CSV again.

Alex Tweedly alex at tweedly.net
Sun Oct 18 20:01:50 EDT 2015



On 18/10/2015 03:17, Peter M. Brigham wrote:
> At this point, finding a function that does the task at all -- reliably and taking into account most of the csv malformations we can anticipate -- would be a start. So far nothing has been unbreakable. Once we find an algorithm that does the job, we can focus on speeding it up.
>
That is indeed the issue.

There are two distinct problems, and the "best" solutions for each may 
be different.

1. Optimistic parser.

Properly parse any well-formed CSV data, in any idiosyncratic dialect of 
CSV that we may be interested in.

Or to put it otherwise, in general we are going to be parsing data 
produced by some program - it may take some oddball approach to CSV 
formatting, but it will be "correct" in the program's own terms. We are 
not (in this problem) trying to handle, e.g., hand-generated files that 
may contain errors, or have deliberate errors embedded. Thus, we do not 
expect things like mis-matched quotes, etc. - and it will be adequate to 
do "something reasonable" given bad input data.

2. Pessimistic parser.

Just the opposite - try to detect any arbitrary malformation with a 
sensible error message, and properly parse any well-formed CSV data in 
any dialect we might encounter.

And common to both
- adequate (optional) control over delimiters, escaped characters in the 
output, etc.
- efficiency (speed) matters

IMHO, we should also specify that the output should
  - remove the enclosing quotes from quoted cells
  - reduce doubled-quotes within a quoted cell to the appropriate single 
instance of a quote
in order that the TSV (or array, or whatever output format is chosen) 
does not need further processing to remove them; i.e. the output data is 
clean of any CSV formatting artifacts.

Personally, I am a pragmatist, and I have always needed solution 1 above 
- whenever I've had to parse CSV data, it's because I had a real-world 
need to do so, and the data was coming from some well-behaved (even if 
very weird) application - so it was consistent and followed some kind of 
rules, however wacky those rules might be. Other people may have 
different needs.

So I believe that any proposed algorithm should be clear about which of 
these two distinct problems it is trying to solve, and should be judged 
accordingly. Then each of us can look for the most efficient solution to 
whichever one they most care about.

I do believe that any solution to problem 2 is also a solution to 
problem 1 - but I don't know if it can be as efficient while tackling 
that harder problem.

-- Alex.






More information about the use-livecode mailing list