How to find offsets in Unicode Text fast
gcanyon at gmail.com
Tue Nov 13 16:39:44 EST 2018
I didn't realize this conversation was just between Bernd and me, so here
it is for the list. Bernd found a solution for the Reykjavík issue
(seemingly -- it works, but it's weird) and based on a conversation in
another thread I have a solution for non-case-sensitive matching. So the
UTF-32 version has been updated to account for those issues. It's available
On Tue, Nov 13, 2018 at 12:23 PM Geoff Canyon <gcanyon at gmail.com> wrote:
> It's amazing to me that appending the character "せ" for the textEncoding
> fixes the issues with the other characters. I have no idea why that would
> affect anything else at all. Maybe the engine crew can weigh in.
> In any case, you seem to have hit on the right bizarre solution, so I
> added that in. I also added a modification to correctly handle case by
> using toUpper instead of (wrongly) depending on caseSensitive, and changed
> from offset to byteOffset, which might speed things up a little. The UTF32
> version is about 3x faster than the item-based solution, but both scale
> well, so I added comments leaving it up to the developer which to use.
> The updated version is here: https://github.com/gcanyon/alloffsets
> On Tue, Nov 13, 2018 at 9:34 AM Niggemann, Bernd <
> Bernd.Niggemann at uni-wh.de> wrote:
>> The thread is very instructive but also a bit disillusioning as far as
>> speed goes. I tried a couple of things Mark Waddingham recommended and they
>> kind of work (I don't know if I did it correctly) but are still slow. Not
>> as slow as simple offset for complex texts but still.
>> Here I pick up on your latest attempt to use UTF-32 which fails on
>> Icelandic Reykjavík (the í is the culprit). There are more Icelandic
>> characters that fail UTF32.
>> On the other hand UTF-32 works surprisingly fast and in many cases
>> Now I figured that forcing the text to be UTF-32 compliant I would cheat
>> in appending a Japanese character to pFind and pSearch before converting to
>> UTF-32 and removing those afterwards.
>> It turns out that it cures the Icelandic disease...
>> It should also cure similar cases in similar languages.
>> It turns out to be accurate in many things I tested in Brian's test stack.
>> I would love to know the limits of this approach
>> Kind regards
>> here is the code, additions marked as "new"
>> *function* allOffsets pFind,pString,pCaseSensitive,pNoOverlaps
>> *-- returns a comma-delimited list of the offsets of pFind in pString*
>> *-- note, this seems to fail on some searches, for example:*
>> *-- searching for "Reykjavík" in "Reykjavík er höfuðborg"*
>> *-- It is uncertain why.*
>> *-- See thread here:
>> *local* tNewPos, tPos, tResult, tSkip
>> *put* "せ" after pFind *#<- new force UTF-32*
>> *put* "せ" after pString *#<- new force UTF-32*
>> *put* textEncode(pFind,"UTF-32") into pFind
>> *put* textEncode(pString,"UTF-32") into pString
>> *delete* byte -4 to -1 of pFind *#<- new force UTF-32*
>> *delete* byte -4 to -1 of pString *#<- new force UTF-32*
>> *if* pNoOverlaps *then* *put* length(pFind) - 1 into tSkip
>> *set* the caseSensitive to pCaseSensitive is true
>> *put* 0 into tPos
>> *repeat* forever
>> *put* offset(pFind, pString, tPos) into tNewPos
>> *if* tNewPos = 0 *then* *exit* *repeat*
>> *add* tNewPos to tPos
>> *if* tPos mod 4 = 1 *then* *put* (tPos div 4 + 1),"" after tResult
>> *if* pNoOverlaps *then* *add* tSkip to tPos
>> *end* * repeat*
>> *if* tResult is empty *then* *return* 0
>> *else* *return* char 1 to -2 of tResult
>> *end* allOffsets
More information about the use-livecode