How to find offsets in Unicode Text fast

Niggemann, Bernd Bernd.Niggemann at uni-wh.de
Mon Nov 12 15:08:42 EST 2018


Ben,

Please see my remarks out failing UTF-32 with some Icelandic characters. Currently I would not recommend offset(UTF-32 text) unless one knows which character set is suited to be used and is in control of that character set. The same goes for UTF-16.

I also thought that byteOffset would be faster for case-sensitive search in UTF-32 text. It turned out to be slower than offset(UTF-32 text).

>Ben Rubinstein via use-livecode<https://www.mail-archive.com/search?l=use-livecode@lists.runrev.com&q=from:%22Ben+Rubinstein+via+use%5C-livecode%22> Mon, 12 Nov 2018 11:38:26 -0800<https://www.mail-archive.com/search?l=use-livecode@lists.runrev.com&q=date:20181112>

>Coming late to this discussion. Very excited by this approach of converting everything to UTF-32 in order to do fast offsets.

>In the meantime I'd be suspicious about doing a case-insensitive search in this way; but my guess would be that, if your use-case will accept case->sensitivity, it would be safer (and faster?) to use byteOffset on the UTF-32 data rather than offset.

Kind regards
Bernd



More information about the use-livecode mailing list