matchText and accented characters

Tue Oct 16 19:59:47 EDT 2007

On Tue, 16 Oct 2007 12:18:54 -0600, Chris Sheffield wrote:

> Thanks, Andres. But that didn't seem to fix the problem. That 
> property, according to the docs, only seems to apply to the numToChar 
> and charToNum functions. I did try it just to make sure.

The issue is that PCRE (which is the lib that Rev uses) *optionally* 
supports locales, so I don't know if any locales were compiled into the 
code that Rev uses. If you knew what you were looking for, you could 
replace the accented characters with their hex equivalents and you'd 
get a match:

  put matchChunk(fld 1,".*(fianc\x8E).*",tStart,tEnd)

in this case "\x8E" means "use hex code 8E", which is ASCII 142, which 
is é (at least on my Mac). To determine this, I ran this code:

  put baseConvert(charToNum("é"),10,16)

which gave me "8E". So if you know specifically the characters to 
match, you can use this.

On the other hand, if you have a big chunk of text and you don't know 
if there are accented chars or not, I would personally run it the 
"brute force" way: 

1) put a copy of the text into another variable
2) replace the accented chars with their non-accented counterparts - a 
dozen or so lines like:
       - replace "é" with "e" in myVar
       - replace "ó" with "o" in myVar
       - etc.
3) run your 'matchChunk' on the second "clean" variable using 
non-accented text (look for "fiance" and not "fiancé")
4) if you get a hit, use the startChar/endChar variables from the 
'matchChunk' to extract the text from the *first* variable (the one 
with the accented text)

Just my 2 cents,

Ken Ray
Sons of Thunder Software, Inc.
Email: kray at sonsothunder.com
Web Site: http://www.sonsothunder.com/