Q for Regexperts

Dar Scott dsc at swcp.com
Fri Oct 17 11:25:34 EDT 2003


On Friday, October 17, 2003, at 07:10 AM, Ivers, Doug E wrote:

> This may be a common need...
>
> I want to parse text into a list of words.

I don't know how to pull out an arbitrary number of captures in a 
Revolution regex.  I use a regex that gets me the first capture and the 
string after that.  I loop on that.

For the apostrophe, a simple model would be to assume a word is a 
sequence of letters but may include embedded apostrophes.
>
> Is there any reason why the same regex won't work with European 
> languages?  Do other languages have characters like the English 
> apostrophe should be considered part of a word?

You can look at the PCRE doc web page (start half-way down) and look at 
the definition of \w, the "word" character matcher.  This will match 
some high characters if the locale is set right.  However, it will 
match underline.  You might be better off, looking at just what you 
want to match in a particular encoding and match with \xhh.  (PCRE has 
a very limited UTF-8 mode, but we don't seem to have a way to turn that 
on; I'd prefer full-width unicode mode when it comes, anyway.)

Dar Scott





More information about the use-livecode mailing list