How to find offsets in Unicode Text fast

Mark Waddingham mark at livecode.com
Tue Nov 13 03:26:55 EST 2018


On 2018-11-13 01:06, Geoff Canyon via use-livecode wrote:
> On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode <
> use-livecode at lists.runrev.com> wrote:
> 
>> I'm really confused that case-insensitive should work at all for 
>> UTF-16 or
>> UTF-32;

The caseSensitive (and formSensitive) properties only apply to strings 
*not* binary strings.

The output of textEncode() is a binary string.

The 'is' operator is overloaded - in strict order:

   left-empty 'is' right-ANY -- returns is-empty(right-ANY)
   left-ANY 'is' right-empty -- returns is-empty(left-ANY)
   left-array 'is' left-array -- compare as array
   left-number 'is' right-number -- compare as number
   left-numeric-[binary]-string 'is' right-numeric-[binary]-string -- 
compare as number
   left-binary-string 'is' right-binary-string -- compare as binary 
strings
   left-any 'is' right-any -- compare as strings

Also concatenation, put after and put before are overloaded:

    binary-string & binary-string -> binary-string
    string & ANY -> string
    ANY & string -> string

    put src-data after|before dst-data -> dst-data is binary-string
    put src-ANY after|before dst-ANY -> dst-ANY is string

> This is so puzzling. I tried this code in a button:
> 
> on mouseUp
>    put "Ѡ" into x
>    put "ѡ" into y
>    --put ("Ѡ" is "ѡ") && (x is y)
>    --exit mouseUp
>    put textencode("Ѡ","UTF-32") into xBig
>    put textencode("ѡ","UTF-32") into xSmall
>    repeat for each byte B in xBig
>       put B after yBig
>    end repeat
>    repeat for each byte B in xSmall
>       put B after ySmall
>    end repeat
>    put "Ѡ" into zBig
>    put "ѡ" into zSmall
>    put zBig into wBig
>    put zSamll into wSmall
>    put textencode(zBig,"UTF-32") into zBig
>    put textencode(zSmall,"UTF-32") into zSmall
>    put x into j
>    put y into k
>    set caseSensitive to false
>    put ("Ѡ" is "ѡ") && (xBig is xSmall) && (yBig is ySmall) && (zBig is
> zSmall) && (wBig is wSmall) && (x is y) && (j is k)
> end mouseUp
> 
> 
> That puts: true false false false true true true
> 
> Things to note:
> 
> 1. "Ѡ" and "ѡ" are upper and lower case omega in cyrillic, 00000460 and
> 00000461. Given the string literals, LC is happy to say they are the 
> same
> (the first true)
> 2. Put them in a variable, LC is happy to say they are the same
> (the second-to-last true).
> 3. Convert them to UTF-32 and LC no longer recognizes them as the same 
> (the
> fourth boolean, false)
> 4. Put the variables into other variables, and LC identifies them as 
> the
> same (the last true)

("Ѡ" is "ѡ") is true because they are both strings
(xBig is xSmall) is false because both sides are binary-strings (and so 
compare byte for byte)
(yBig is ySmall) is false because both sides are binary-strings
(zBig is zSmall) is false because you've textEncoded strings which 
produce binary-strings so both are binary strings
(wBig is wSmall) is true because both sides are strings
(x is y) is true because both sides are strings
(j is k) is true because both sides are strings

One could argue that 'is'/'is not' should never have been overloaded to 
do binary string comparison - and that should have perhaps been added as 
a separate operator (especially since binary strings are compared as 
numbers if numeric). With hindsight I'd probably agree as it is a slight 
discontinuity in terms of comparison with pre-7.

Indeed, had we not added that overload then we would not be having this 
discussion - it would have been a similar discussion as used to come up 
a lot with comparing the output of compress() and other functions which 
have always produced binary data - and why comparisons seemed 'not as 
one would expect'.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list