Parse a CSV File with Regular Expressions

Thomas Gutzmann thomas.gutzmann at gutzmann.com
Tue Jan 18 09:34:25 EST 2005


Alex, you nightmare,

> But it fails on a few very common cases, including empty, quoted fields and multiple adjacent 
>quotes within fields
> 
>> d:\Our Documents\Alex> perl re1.pl
>> Original:  """My family"",""My PowerBook"",""My Defender 
>> 110""","","mylife at home.com"
>> After replacement:  "'My family','My PowerBook','My Defender 
>> 110'","","mylife at home.com"
>>     Item 1: 'My family','My PowerBook'
>>     Item 2: 'My Defender 110'
>>     Item 3: ,
>>
>> d:\Our Documents\Alex>
> 
> I'm sure there's a way round this too .... but I suspect it's time to stop drawing out these 
>examples.

#!/usr/bin/perl
@s = (
     '"""My family"",""My PowerBook"",""My Defender 110""","","mylife at home.com"');
foreach (@s) {
     print ("Original: $_\n");
     s/""([^"]*)""/'$1'/g;
     print ("After replacement: $_\n");
     if (/"*([^"]*)"*,"*([^"]*)"*,"*([^"]*)"*/) {
         print ("\tItem 1: $1\n\tItem 2: $2\n\tItem 3: $3\n");
     }
}

I disallowed empty string by using "+" instead of "*" - using "*" solves this issue.

>> As you can see, embedded newline characters don't affect the result; 
>> this problem must be solved in the routine reading the lines. You can 
>> also ignore EOL bei excluding "$" (this is EOL for RE), but I haven't 
>> tested it, and I also don't have the time for it. Normally, you don't 
>> have these problems.
> 
> Actually, normally I do have this problem. Palm Pilot exports usually have embedded CR within 
>quoted fields, and that's one I often deal with.

It depends if the embedded CR is an EOL. If no, my example works unchanged. If yes, it takes some 
more thinking, because you have to identify true end-of-lines (which are end-of-records in this 
case), and you have to cope with missing fields which would screw up everything. But I suspect 
that there is a distinction between end-of-line (e.g. CR) and end-of-record (e.g. LF or CR/LF) - 
most decent programmers would create some sort of record boundary, while the embedded CR is used 
for field formatting.

By the way, I can understand your aversion against Perl. But it has it's virtues, if you use it 
for well defined and limited purposes, keep programs short and spend enough time on clean 
programming and documentation. But whom do I tell it...

Cheers,

Thomas G.


More information about the use-livecode mailing list