How to find offsets in Unicode Text fast

Geoff Canyon gcanyon at gmail.com
Mon Nov 12 19:06:09 EST 2018


On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode <
use-livecode at lists.runrev.com> wrote:

> I'm really confused that case-insensitive should work at all for UTF-16 or
> UTF-32;


This is so puzzling. I tried this code in a button:

on mouseUp
   put "Ѡ" into x
   put "ѡ" into y
   --put ("Ѡ" is "ѡ") && (x is y)
   --exit mouseUp
   put textencode("Ѡ","UTF-32") into xBig
   put textencode("ѡ","UTF-32") into xSmall
   repeat for each byte B in xBig
      put B after yBig
   end repeat
   repeat for each byte B in xSmall
      put B after ySmall
   end repeat
   put "Ѡ" into zBig
   put "ѡ" into zSmall
   put zBig into wBig
   put zSamll into wSmall
   put textencode(zBig,"UTF-32") into zBig
   put textencode(zSmall,"UTF-32") into zSmall
   put x into j
   put y into k
   set caseSensitive to false
   put ("Ѡ" is "ѡ") && (xBig is xSmall) && (yBig is ySmall) && (zBig is
zSmall) && (wBig is wSmall) && (x is y) && (j is k)
end mouseUp


That puts: true false false false true true true

Things to note:

1. "Ѡ" and "ѡ" are upper and lower case omega in cyrillic, 00000460 and
00000461. Given the string literals, LC is happy to say they are the same
(the first true).
2. Put them in a variable, LC is happy to say they are the same
(the second-to-last true).
3. Convert them to UTF-32 and LC no longer recognizes them as the same (the
fourth boolean, false)
4. Put the variables into other variables, and LC identifies them as the
same (the last true)

gc



More information about the use-livecode mailing list