Parse a CSV File with Regular Expressions

Alex Tweedly alex at tweedly.net
Tue Jan 18 06:21:15 EST 2005


Thomas Gutzmann wrote:
note - I've re-ordered your email to reply to sections of it in a 
different order

>
> I don't know if it works in Rev because every implementation of RE is 
> a bit different, and Perl has the best I've come across. Anyway: Perl 
> can be installed on every machine, it's pre-installed on Unix, Linux 
> and MacOS/X, so just use the power of this language in combination 
> with Rev, RB or whatever development tool you use, instead of trying 
> to do everything with one tool.

In general, I fully agree with that - use the right tool for the right 
job. But if you have a large Rev application, it is neither simple nor 
clean to fire up a Perl script to do one small part such as importing a 
file from another app; far simpler and better to do it within the Rev 
app if feasible. If I had a large, mainly regex app to do, I'd consider 
Perl - but within the context of this being 1% of an otherwise Rev app, 
it's worthi finding a Transcript or Rev-regex solution.

> I'm missing this flexibility in the usage of tools in the IT world. 
> Nobody in the industry would use a Porsche to transport stones (except 
> the ones weared around the neck or wherever ladies have them), and 
> nobody would drive a fork-lift truck on a (German) Autobahn.

That reminds me of the Silicon Valley saying :
   "You don't need a Porsche to commute 10 miles to work - but you'd 
never know that from looking at Highway 101".

> Most of us use hands and feet for their respective purposes. So why do 
> programmers want to use one tool for all? 

Because there's a level of inefficiency and discomfort caused by 
frequent changes in language and tools. Because it's hard to become an 
expert in one language - doing it in Rev and Perl and PHP and Python and 
Java and .... is probably impossible. Because it's easy, but wrong, to 
write one language using the style and tricks of another (see various 
blog threads about "Python's not Java", etc.)

But mostly just because programmers are people :-)


> On Tue, 18 Jan 2005 01:36:48 +0000
>  Alex Tweedly <alex at tweedly.net> wrote:
>
>> Rev's RE library is based on PCRE, so should be adequately capable.
>>
>> However, I don't think it's as easy to parse the realistic version of 
>> CSV with REs as you might think.
>
>
> Well, Alex, it's not so difficult with Perl.

I have to admit I was under the misapprehension that PCRE meant that it 
was very close to full Perl; I'm not so sure about that now.

> If the items in the comma-separated list can contain other commata, in 
> which case they are enclosed by quotes (optionally otherwise), like 
> '"a,b",c,"d"', then the Perl script to parse the list looks like:
>
> #!/usr/bin/perl
> @s = (
>     '"My family, My PowerBook, My Defender 110","1","mylife at home.com"',
>     'Scrooge,2,billionaire at minimum.com',
>     'RunRev List,"3,4,...","all at the-rest.co.uk"');
> foreach (@s) {
>     if (/"*([^"]+)"*,"*([^"]+)"*,"*([^"]+)"*/) {
>         print ("$_\n\t$1\n\t$2\n\t$3\n");
>     }
> }
>
Yeah - that's a good start. In the "scoring system" I invented last 
night while looking at various "csv" scripts, that's probably a 60% or 
70% solution; it's the remaining 30% that is hard.

This is NOT a challenge !  If you want to go further because you're 
interested - please do. But don't feel that I'm "challenging" you to do 
so. I have a solution (scripted) that is perfectly adequate in coverage 
(maybe 90% or 95% - certainly not 100%), and more than adequate in speed.

The remaining cases include (but are not limited to)
 - embedded CRs (or newline, or line breaks)
 - embedded quotes, which can be either escaped (preceded by '\') or 
more often doubled ("a field named ""alex"" is here")
      (but each file should have one or other - never seen both in the 
same file, though it wouldn't surprise me
       if some MS product did that)
 - including (or excluding) non-embedded spaces before or after the 
quoted field)

The best example of the embedded quote case is
"""My family"",""My PowerBook"",""My Defender 110""","1","mylife at home.com"

which should (obviously) give
        "My family","My PowerBook","My Defender 110"
        1
        mylife at home.com

-- Alex.


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005



More information about the use-livecode mailing list