How to find offsets in Unicode Text fast
Ben Rubinstein
benr_mc at cogapp.com
Mon Nov 12 14:36:22 EST 2018
Coming late to this discussion. Very excited by this approach of converting
everything to UTF-32 in order to do fast offsets.
I'm really confused that case-insensitive should work at all for UTF-16 or
UTF-32; at this point as far as I understand it, LC has no idea that how to
correctly interpret the value of the variable as text. Or at least, I'd expect
it work for some things - e.g. A/a which are the same as single bytes; and
_also_ for Å/å because those are also equivalently 'single byte' - 0xC5 and
0xE5; but not for e.g. Ă/ă which are are 0x0102 and 0x0103, where I wouldn't
expect 0x03 to be considered as a case-shifted version of 0x02. All this just
proves that I don't understand what the new(ish) engine is doing with strings.
I'm going to start a new thread to explore this.
In the meantime I'd be suspicious about doing a case-insensitive search in
this way; but my guess would be that, if your use-case will accept
case-sensitivity, it would be safer (and faster?) to use byteOffset on the
UTF-32 data rather than offset.
Mr Very Picky would also suggest that to be really correct, the code in this
case should also check that the offset found was on a four-byte boundary (tPos
mod 4 = 1); it's probably a purely theoretical consideration, but I think that
the four-byte sequence (representing the character you're searching for) could
in theory be incorrectly matched across two other characters.
On 12/11/2018 05:00, Brian Milby via use-livecode wrote:
> I just tried one additional test. Search for "åå" within "aaååÅÅååaa".
> (On a Mac keyboard, the characters are made with A, Option-A, and
> Shift-Option-A.) The Offset UTF16 version does not return the correct
> result if case sensitive is false (returns the same value as if it were
> true: 3,7). Every other version correctly performs the case folding
> (3,4,5,6,7).
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
More information about the use-livecode
mailing list