Parse a CSV File with Regular Expressions

Thomas Gutzmann thomas.gutzmann at gutzmann.com
Tue Jan 18 08:44:47 EST 2005


Hi Alex,

a 100% solution is not possible with one RE because embedded quotes cannot be converted at the 
same time as the rest is parsed.

> The best example of the embedded quote case is
> """My family"",""My PowerBook"",""My Defender 110""","1","mylife at home.com"

I have modified the Perl script to convert double doublequotes (""x"") to single simple quotes 
('x'). It's just one way, and of course I'm using a regular expression for that:

#!/usr/bin/perl
@s = (
	'"My family, My PowerBook, My Defender 110","1","mylife at home.com"',
	'Scrooge,2,billionaire at minimum.com',
	'RunRev List,"3,
	4,
	...","all at the-rest.co.uk"',
	' """My family"",""My PowerBook"",""My Defender 110""","1","mylife at home.com"');
foreach (@s) {
	print ("Original: $_\n");
	s/""([^"]*)""/'$1'/g;
	print ("After replacement: $_\n");
	if (/"*([^"]+)"*,"*([^"]+)"*,"*([^"]+)"*/) {
		print ("\tItem 1: $1\n\tItem 2: $2\n\tItem 3: $3\n");
	}
}

The result is

Original: "My family, My PowerBook, My Defender 110","1","mylife at home.com"
After replacement: "My family, My PowerBook, My Defender 110","1","mylife at home.com"
         Item 1: My family, My PowerBook, My Defender 110
         Item 2: 1
         Item 3: mylife at home.com
Original: Scrooge,2,billionaire at minimum.com
After replacement: Scrooge,2,billionaire at minimum.com
         Item 1: Scrooge
         Item 2: 2
         Item 3: billionaire at minimum.com
Original: RunRev List,"3,
         4,
         ...","all at the-rest.co.uk"
After replacement: RunRev List,"3,
         4,
         ...","all at the-rest.co.uk"
         Item 1: RunRev List
         Item 2: 3,
         4,
         ...
         Item 3: all at the-rest.co.uk
Original:  """My family"",""My PowerBook"",""My Defender 110""","1","mylife at home.com"
After replacement:  "'My family','My PowerBook','My Defender 110'","1","mylife at home.com"
         Item 1: 'My family','My PowerBook','My Defender 110'
         Item 2: 1
         Item 3: mylife at home.com

As you can see, embedded newline characters don't affect the result; this problem must be solved 
in the routine reading the lines. You can also ignore EOL bei excluding "$" (this is EOL for RE), 
but I haven't tested it, and I also don't have the time for it. Normally, you don't have these 
problems.

>> Most of us use hands and feet for their respective purposes. So why do 
>> programmers want to use one tool for all? 
> 
> Because there's a level of inefficiency and discomfort caused by frequent changes in language 
>and tools. Because it's hard to become an expert in one language - doing it in Rev and Perl and 
>PHP and Python and Java and .... is probably impossible. Because it's easy, but wrong, to write 
>one language using the style and tricks of another (see various blog threads about "Python's not 
>Java", etc.)
> 
> But mostly just because programmers are people :-)

Well, I don't agree. A good programmer should master a whole box of tools, and I also expect good 
developers to be multilingual. One of the problems we have in IT today comes from the fact, that 
too many people learn just one language (Java), and just the basics of database systems (primitive 
SQL à la MySQL, no procedural SQL), and that they are also limited in their knowledge of tools.

In a philosphical view, only knowledge gives you the possibility to choose, to differentiate, and 
to understand. This, in short, is one of the most important aspects of free will - in public and 
private life as well as in the job. An old saying in Germany goes like "Knowledge gives you 
freedom" ("Wissen macht frei").

But this discussion doesn't belong here.

Cheers,

Thomas G.


More information about the use-livecode mailing list