Parse a CSV File with Regular Expressions
Alex Tweedly
alex at tweedly.net
Tue Jan 18 06:21:15 EST 2005
Thomas Gutzmann wrote:
note - I've re-ordered your email to reply to sections of it in a
different order
>
> I don't know if it works in Rev because every implementation of RE is
> a bit different, and Perl has the best I've come across. Anyway: Perl
> can be installed on every machine, it's pre-installed on Unix, Linux
> and MacOS/X, so just use the power of this language in combination
> with Rev, RB or whatever development tool you use, instead of trying
> to do everything with one tool.
In general, I fully agree with that - use the right tool for the right
job. But if you have a large Rev application, it is neither simple nor
clean to fire up a Perl script to do one small part such as importing a
file from another app; far simpler and better to do it within the Rev
app if feasible. If I had a large, mainly regex app to do, I'd consider
Perl - but within the context of this being 1% of an otherwise Rev app,
it's worthi finding a Transcript or Rev-regex solution.
> I'm missing this flexibility in the usage of tools in the IT world.
> Nobody in the industry would use a Porsche to transport stones (except
> the ones weared around the neck or wherever ladies have them), and
> nobody would drive a fork-lift truck on a (German) Autobahn.
That reminds me of the Silicon Valley saying :
"You don't need a Porsche to commute 10 miles to work - but you'd
never know that from looking at Highway 101".
> Most of us use hands and feet for their respective purposes. So why do
> programmers want to use one tool for all?
Because there's a level of inefficiency and discomfort caused by
frequent changes in language and tools. Because it's hard to become an
expert in one language - doing it in Rev and Perl and PHP and Python and
Java and .... is probably impossible. Because it's easy, but wrong, to
write one language using the style and tricks of another (see various
blog threads about "Python's not Java", etc.)
But mostly just because programmers are people :-)
> On Tue, 18 Jan 2005 01:36:48 +0000
> Alex Tweedly <alex at tweedly.net> wrote:
>
>> Rev's RE library is based on PCRE, so should be adequately capable.
>>
>> However, I don't think it's as easy to parse the realistic version of
>> CSV with REs as you might think.
>
>
> Well, Alex, it's not so difficult with Perl.
I have to admit I was under the misapprehension that PCRE meant that it
was very close to full Perl; I'm not so sure about that now.
> If the items in the comma-separated list can contain other commata, in
> which case they are enclosed by quotes (optionally otherwise), like
> '"a,b",c,"d"', then the Perl script to parse the list looks like:
>
> #!/usr/bin/perl
> @s = (
> '"My family, My PowerBook, My Defender 110","1","mylife at home.com"',
> 'Scrooge,2,billionaire at minimum.com',
> 'RunRev List,"3,4,...","all at the-rest.co.uk"');
> foreach (@s) {
> if (/"*([^"]+)"*,"*([^"]+)"*,"*([^"]+)"*/) {
> print ("$_\n\t$1\n\t$2\n\t$3\n");
> }
> }
>
Yeah - that's a good start. In the "scoring system" I invented last
night while looking at various "csv" scripts, that's probably a 60% or
70% solution; it's the remaining 30% that is hard.
This is NOT a challenge ! If you want to go further because you're
interested - please do. But don't feel that I'm "challenging" you to do
so. I have a solution (scripted) that is perfectly adequate in coverage
(maybe 90% or 95% - certainly not 100%), and more than adequate in speed.
The remaining cases include (but are not limited to)
- embedded CRs (or newline, or line breaks)
- embedded quotes, which can be either escaped (preceded by '\') or
more often doubled ("a field named ""alex"" is here")
(but each file should have one or other - never seen both in the
same file, though it wouldn't surprise me
if some MS product did that)
- including (or excluding) non-embedded spaces before or after the
quoted field)
The best example of the embedded quote case is
"""My family"",""My PowerBook"",""My Defender 110""","1","mylife at home.com"
which should (obviously) give
"My family","My PowerBook","My Defender 110"
1
mylife at home.com
-- Alex.
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005
More information about the use-livecode
mailing list