How to find offsets in Unicode Text fast

Ben Rubinstein benr_mc at cogapp.com
Mon Nov 12 14:36:22 EST 2018


Coming late to this discussion. Very excited by this approach of converting 
everything to UTF-32 in order to do fast offsets.

I'm really confused that case-insensitive should work at all for UTF-16 or 
UTF-32; at this point as far as I understand it, LC has no idea that how to 
correctly interpret the value of the variable as text. Or at least, I'd expect 
it work for some things - e.g. A/a which are the same as single bytes; and 
_also_ for Å/å because those are also equivalently 'single byte' - 0xC5 and 
0xE5; but not for e.g. Ă/ă which are are 0x0102 and 0x0103, where I wouldn't 
expect 0x03 to be considered as a case-shifted version of 0x02. All this just 
proves that I don't understand what the new(ish) engine is doing with strings. 
I'm going to start a new thread to explore this.

In the meantime I'd be suspicious about doing a case-insensitive search in 
this way; but my guess would be that, if your use-case will accept 
case-sensitivity, it would be safer (and faster?) to use byteOffset on the 
UTF-32 data rather than offset.

Mr Very Picky would also suggest that to be really correct, the code in this 
case should also check that the offset found was on a four-byte boundary (tPos 
mod 4 = 1); it's probably a purely theoretical consideration, but I think that 
the four-byte sequence (representing the character you're searching for) could 
in theory be incorrectly matched across two other characters.

On 12/11/2018 05:00, Brian Milby via use-livecode wrote:
> I just tried one additional test.  Search for "åå" within "aaååÅÅååaa".
> (On a Mac keyboard, the characters are made with A, Option-A, and
> Shift-Option-A.)  The Offset UTF16 version does not return the correct
> result if case sensitive is false (returns the same value as if it were
> true: 3,7).  Every other version correctly performs the case folding
> (3,4,5,6,7).
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
> 




More information about the use-livecode mailing list