Parse a CSV File with Regular Expressions

Thomas Gutzmann thomas.gutzmann at gutzmann.com
Tue Jan 18 03:32:31 EST 2005


On Tue, 18 Jan 2005 01:36:48 +0000
  Alex Tweedly <alex at tweedly.net> wrote:
> Rev's RE library is based on PCRE, so should be adequately capable.
> 
> However, I don't think it's as easy to parse the realistic version of CSV with REs as you might 
>think.

Well, Alex, it's not so difficult with Perl. If the items in the comma-separated list can contain 
other commata, in which case they are enclosed by quotes (optionally otherwise), like 
'"a,b",c,"d"', then the Perl script to parse the list looks like:

#!/usr/bin/perl
@s = (
	'"My family, My PowerBook, My Defender 110","1","mylife at home.com"',
	'Scrooge,2,billionaire at minimum.com',
	'RunRev List,"3,4,...","all at the-rest.co.uk"');
foreach (@s) {
	if (/"*([^"]+)"*,"*([^"]+)"*,"*([^"]+)"*/) {
		print ("$_\n\t$1\n\t$2\n\t$3\n");
	}
}

This example gives the result:

"My family, My PowerBook, My Defender 110","1","mylife at home.com"
         My family, My PowerBook, My Defender 110
         1
         mylife at home.com
Scrooge,2,billionaire at minimum.com
         Scrooge
         2
         billionaire at minimum.com
RunRev List,"3,4,...","all at the-rest.co.uk"
         RunRev List
         3,4,...
         all at the-rest.co.uk

which is what you would expect.

I don't know if it works in Rev because every implementation of RE is a bit different, and Perl 
has the best I've come across. Anyway: Perl can be installed on every machine, it's pre-installed 
on Unix, Linux and MacOS/X, so just use the power of this language in combination with Rev, RB or 
whatever development tool you use, instead of trying to do everything with one tool.

I'm missing this flexibility in the usage of tools in the IT world. Nobody in the industry would 
use a Porsche to transport stones (except the ones weared around the neck or wherever ladies have 
them), and nobody would drive a fork-lift truck on a (German) Autobahn. Most of us use hands and 
feet for their respective purposes. So why do programmers want to use one tool for all?

Cheers,

Thomas G.

---

For those of you who find it hard to read regular expressions (they are a good example of a 
write-only language):

/"*([^"]+)"*,"*([^"]+)"*,"*([^"]+)"*/

represents 3 times the same group, separated by a comma: "*([^"]+)"*

This expression contains a prefix and a postfix: "* - which means "zero or more quotes".

In the middle of the expression - enclosed in brackets - is the term to be extracted: [^"]+ - 
which reads: any character except a quote, but at least one. If you replace the "+" with a "*", it 
would be allowed to have to commata following each other.

The regular expression can be shortened even more, but then it becomes completely 
uncomprehensible, and you need more time to comment it than to write it.


More information about the use-livecode mailing list