Q for Regexperts
dsc at swcp.com
Fri Oct 17 11:25:34 EDT 2003
On Friday, October 17, 2003, at 07:10 AM, Ivers, Doug E wrote:
> This may be a common need...
> I want to parse text into a list of words.
I don't know how to pull out an arbitrary number of captures in a
Revolution regex. I use a regex that gets me the first capture and the
string after that. I loop on that.
For the apostrophe, a simple model would be to assume a word is a
sequence of letters but may include embedded apostrophes.
> Is there any reason why the same regex won't work with European
> languages? Do other languages have characters like the English
> apostrophe should be considered part of a word?
You can look at the PCRE doc web page (start half-way down) and look at
the definition of \w, the "word" character matcher. This will match
some high characters if the locale is set right. However, it will
match underline. You might be better off, looking at just what you
want to match in a particular encoding and match with \xhh. (PCRE has
a very limited UTF-8 mode, but we don't seem to have a way to turn that
on; I'd prefer full-width unicode mode when it comes, anyway.)
More information about the Use-livecode