How to find offsets in Unicode Text fast
Geoff Canyon
gcanyon at gmail.com
Mon Nov 12 19:06:09 EST 2018
On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode <
use-livecode at lists.runrev.com> wrote:
> I'm really confused that case-insensitive should work at all for UTF-16 or
> UTF-32;
This is so puzzling. I tried this code in a button:
on mouseUp
put "Ѡ" into x
put "ѡ" into y
--put ("Ѡ" is "ѡ") && (x is y)
--exit mouseUp
put textencode("Ѡ","UTF-32") into xBig
put textencode("ѡ","UTF-32") into xSmall
repeat for each byte B in xBig
put B after yBig
end repeat
repeat for each byte B in xSmall
put B after ySmall
end repeat
put "Ѡ" into zBig
put "ѡ" into zSmall
put zBig into wBig
put zSamll into wSmall
put textencode(zBig,"UTF-32") into zBig
put textencode(zSmall,"UTF-32") into zSmall
put x into j
put y into k
set caseSensitive to false
put ("Ѡ" is "ѡ") && (xBig is xSmall) && (yBig is ySmall) && (zBig is
zSmall) && (wBig is wSmall) && (x is y) && (j is k)
end mouseUp
That puts: true false false false true true true
Things to note:
1. "Ѡ" and "ѡ" are upper and lower case omega in cyrillic, 00000460 and
00000461. Given the string literals, LC is happy to say they are the same
(the first true).
2. Put them in a variable, LC is happy to say they are the same
(the second-to-last true).
3. Convert them to UTF-32 and LC no longer recognizes them as the same (the
fourth boolean, false)
4. Put the variables into other variables, and LC identifies them as the
same (the last true)
gc
More information about the use-livecode
mailing list