How to find offsets in Unicode Text fast

Geoff Canyon gcanyon at gmail.com
Tue Nov 13 00:17:18 EST 2018


A few things:

1. It seems codepointOffset can only find a single character? So it
won't work for any search for a multi-character string?
2: codepointOffset seems to work differently for multi-byte characters and
regular characters:

put codepointoffset("e","↘ndatestest",6) -- puts 3
put codepointoffset("e","andatestest",6) -- puts 9

3: It seems that when multi-byte characters are involved, codepointOffset
suffers from the same sort of slow-down as offset does. For example, in a
145K string with about 20K hits for a single character, a simple
codepointOffset routine (below) takes over 10 seconds, while the item-based
routine takes about 0.3 seconds for the same results.

On Mon, Nov 12, 2018 at 4:21 PM Monte Goulding via use-livecode <
use-livecode at lists.runrev.com> wrote:

> Hi Folks
>
> I was a bit perplexed by this so I had a quick look about the engine and I
> see the issue. The problem is you are using `offset` which works on
> characters. Characters in LiveCode are neither unicode codepoints or bytes.
> They are graphemes. This means that when you have chars to skip the entire
> string needs to be parsed to find the grapheme boundaries so that the index
> can be translated into graphemes to skip. Note that if the strings you were
> dealing with weren’t unicode then the translation of chars to graphemes is
> 1 -> 1 so there’s no big cost which is why things are much faster when you
> textEncode and offset that.
>
> So! Change to using codepointOffset and hopefully it will be much speedier!
>
> Cheers
>
> Monte
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode



More information about the use-livecode mailing list