How to find offsets in Unicode Text fast

Geoff Canyon gcanyon at gmail.com
Tue Nov 13 16:39:44 EST 2018


I didn't realize this conversation was just between Bernd and me, so here
it is for the list. Bernd found a solution for the Reykjavík issue
(seemingly -- it works, but it's weird) and based on a conversation in
another thread I have a solution for non-case-sensitive matching. So the
UTF-32 version has been updated to account for those issues. It's available
here: https://github.com/gcanyon/alloffsets

On Tue, Nov 13, 2018 at 12:23 PM Geoff Canyon <gcanyon at gmail.com> wrote:

> It's amazing to me that appending the character "せ" for the textEncoding
> fixes the issues with the other characters. I have no idea why that would
> affect anything else at all. Maybe the engine crew can weigh in.
>
> In any case, you seem to have hit on the right bizarre solution, so I
> added that in. I also added a modification to correctly handle case by
> using toUpper instead of (wrongly) depending on caseSensitive, and changed
> from offset to byteOffset, which might speed things up a little. The UTF32
> version is about 3x faster than the item-based solution, but both scale
> well, so I added comments leaving it up to the developer which to use.
>
> The updated version is here: https://github.com/gcanyon/alloffsets
>
> On Tue, Nov 13, 2018 at 9:34 AM Niggemann, Bernd <
> Bernd.Niggemann at uni-wh.de> wrote:
>
>> Geoff,
>>
>> The thread is very instructive but also a bit disillusioning as far as
>> speed goes. I tried a couple of things Mark Waddingham recommended and they
>> kind of work (I don't know if I did it correctly) but are still slow. Not
>> as slow as simple offset for complex texts but still.
>>
>> Here I pick up on your latest attempt to use UTF-32 which fails on
>> Icelandic Reykjavík (the í is the culprit). There are more Icelandic
>> characters that fail UTF32.
>>
>> On the other hand UTF-32 works surprisingly fast and in many cases
>> accurately.
>>
>> Now I figured that forcing the text to be UTF-32 compliant I would cheat
>> in appending a Japanese character to pFind and pSearch before converting to
>> UTF-32 and removing those afterwards.
>>
>> It turns out that it cures the Icelandic disease...
>> It should also cure similar cases in similar languages.
>> It turns out to be accurate in many things I tested in Brian's test stack.
>>
>> I would love to know the limits of this approach
>>
>> Kind regards
>>
>> Bernd
>>
>> here is the code, additions marked as "new"
>>
>>
>> -------------------------------------------------
>> *function* allOffsets pFind,pString,pCaseSensitive,pNoOverlaps
>>    *-- returns a comma-delimited list of the offsets of pFind in pString*
>>    *-- note, this seems to fail on some searches, for example:*
>>    *-- searching for "Reykjavík" in "Reykjavík er höfuðborg"*
>>    *-- It is uncertain why.*
>>    *-- See thread here:
>> http://lists.runrev.com/pipermail/use-livecode/2018-November/251357.html
>> <http://lists.runrev.com/pipermail/use-livecode/2018-November/251357.html>*
>>    *local* tNewPos, tPos, tResult, tSkip
>>
>>
>>    *put* "せ" after pFind *#<- new force UTF-32*
>>    *put* "せ" after pString *#<- new force UTF-32*
>>
>>
>>
>>
>>    *put* textEncode(pFind,"UTF-32") into pFind
>>    *put* textEncode(pString,"UTF-32") into pString
>>
>>
>>    *delete* byte -4 to -1 of pFind *#<- new force UTF-32*
>>    *delete* byte -4 to -1 of pString *#<- new force UTF-32*
>>
>>
>>    *if* pNoOverlaps *then* *put* length(pFind) - 1 into tSkip
>>
>>
>>    *set* the caseSensitive to pCaseSensitive is true
>>    *put* 0 into tPos
>>    *repeat* forever
>>       *put* offset(pFind, pString, tPos) into tNewPos
>>       *if* tNewPos = 0 *then* *exit* *repeat*
>>       *add* tNewPos to tPos
>>       *if* tPos mod 4 = 1 *then* *put* (tPos div 4 + 1),"" after tResult
>>       *if* pNoOverlaps *then* *add* tSkip to tPos
>>    *end* * repeat*
>>    *if* tResult is empty *then* *return* 0
>>    *else* *return* char 1 to -2 of tResult
>> *end* allOffsets
>> -------------------------------------------------
>>
>



More information about the use-livecode mailing list