CSV again.
Alex Tweedly
alex at tweedly.net
Sun Oct 18 20:01:50 EDT 2015
On 18/10/2015 03:17, Peter M. Brigham wrote:
> At this point, finding a function that does the task at all -- reliably and taking into account most of the csv malformations we can anticipate -- would be a start. So far nothing has been unbreakable. Once we find an algorithm that does the job, we can focus on speeding it up.
>
That is indeed the issue.
There are two distinct problems, and the "best" solutions for each may
be different.
1. Optimistic parser.
Properly parse any well-formed CSV data, in any idiosyncratic dialect of
CSV that we may be interested in.
Or to put it otherwise, in general we are going to be parsing data
produced by some program - it may take some oddball approach to CSV
formatting, but it will be "correct" in the program's own terms. We are
not (in this problem) trying to handle, e.g., hand-generated files that
may contain errors, or have deliberate errors embedded. Thus, we do not
expect things like mis-matched quotes, etc. - and it will be adequate to
do "something reasonable" given bad input data.
2. Pessimistic parser.
Just the opposite - try to detect any arbitrary malformation with a
sensible error message, and properly parse any well-formed CSV data in
any dialect we might encounter.
And common to both
- adequate (optional) control over delimiters, escaped characters in the
output, etc.
- efficiency (speed) matters
IMHO, we should also specify that the output should
- remove the enclosing quotes from quoted cells
- reduce doubled-quotes within a quoted cell to the appropriate single
instance of a quote
in order that the TSV (or array, or whatever output format is chosen)
does not need further processing to remove them; i.e. the output data is
clean of any CSV formatting artifacts.
Personally, I am a pragmatist, and I have always needed solution 1 above
- whenever I've had to parse CSV data, it's because I had a real-world
need to do so, and the data was coming from some well-behaved (even if
very weird) application - so it was consistent and followed some kind of
rules, however wacky those rules might be. Other people may have
different needs.
So I believe that any proposed algorithm should be clear about which of
these two distinct problems it is trying to solve, and should be judged
accordingly. Then each of us can look for the most efficient solution to
whichever one they most care about.
I do believe that any solution to problem 2 is also a solution to
problem 1 - but I don't know if it can be as efficient while tackling
that harder problem.
-- Alex.
More information about the use-livecode
mailing list