How to find offsets in Unicode Text fast

Geoff Canyon gcanyon at gmail.com
Tue Nov 13 00:35:54 EST 2018


I didn't realize until now that offset() simply fails with some unicode
strings:

put offset("a","↘𠜎qeiuruioqeaaa↘𠜎qeiuar",13) -- puts 0

On Mon, Nov 12, 2018 at 9:17 PM Geoff Canyon <gcanyon at gmail.com> wrote:

> A few things:
>
> 1. It seems codepointOffset can only find a single character? So it
> won't work for any search for a multi-character string?
> 2: codepointOffset seems to work differently for multi-byte characters and
> regular characters:
>
> put codepointoffset("e","↘ndatestest",6) -- puts 3
> put codepointoffset("e","andatestest",6) -- puts 9
>
> 3: It seems that when multi-byte characters are involved, codepointOffset
> suffers from the same sort of slow-down as offset does. For example, in a
> 145K string with about 20K hits for a single character, a simple
> codepointOffset routine (below) takes over 10 seconds, while the item-based
> routine takes about 0.3 seconds for the same results.
>
> On Mon, Nov 12, 2018 at 4:21 PM Monte Goulding via use-livecode <
> use-livecode at lists.runrev.com> wrote:
>
>> Hi Folks
>>
>> I was a bit perplexed by this so I had a quick look about the engine and
>> I see the issue. The problem is you are using `offset` which works on
>> characters. Characters in LiveCode are neither unicode codepoints or bytes.
>> They are graphemes. This means that when you have chars to skip the entire
>> string needs to be parsed to find the grapheme boundaries so that the index
>> can be translated into graphemes to skip. Note that if the strings you were
>> dealing with weren’t unicode then the translation of chars to graphemes is
>> 1 -> 1 so there’s no big cost which is why things are much faster when you
>> textEncode and offset that.
>>
>> So! Change to using codepointOffset and hopefully it will be much
>> speedier!
>>
>> Cheers
>>
>> Monte
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>
>



More information about the use-livecode mailing list