How to find offsets in Unicode Text fast

Geoff Canyon gcanyon at gmail.com
Sun Nov 11 02:50:14 EST 2018


One thing I don't get is how (not) caseSensitive gets handled? Once the
text is all binary data, is the engine really still able to look at the
binary values for "A" and "a" and treat them as the same?

On Sat, Nov 10, 2018 at 8:54 PM Brian Milby via use-livecode <
use-livecode at lists.runrev.com> wrote:

> The correct formula for UTF16 should be:
> put tPos div 2 + 1,"" after tResult
>
> The correct formula for UTF32 should be:
> put tPos div 4 + 1,"" after tResult
>
> If you go to card #6 of my stack that is on GitHub, it has the first
> chapter of John that I copied from the internet.  I added a single UTF(8?)
> character to it so get the results that are listed (last char of the first
> visible line).
>
> Making the change is dramatic.  Offset takes 65k ms.  Geoff's code and my
> code take 6k ms.  Converting to UTF16 first and using offset takes 519 ms.
> And note that the timings include those 2 calls to convert the variables to
> UTF16.
>
> Now, if I take Geoff's sample that didn't work, I get the same incorrect
> results.  The problem is that some of those characters end up needing 2
> UTF16 codepoints to represent them. (I hope I'm using the correct
> terminology.)
>
> So, the safe solution is to always use UTF32.  Speed looks to be close
> between them.  On my largest text example, UTF16 took 448ms and UTF32 took
> 584ms.
>
> I'll update my stack and push an update to my repo so others can check out
> the new code.
>
> On Sat, Nov 10, 2018 at 6:05 PM Niggemann, Bernd <
> Bernd.Niggemann at uni-wh.de>
> wrote:
>
> > That is what I alluded to,
> > UTF is  a wild country and I don't know my ways,
> > try
> > -----------------------------
> > function allOffsets pDelim, pString, pCaseSensitive
> >    local tNewPos, tPos, tResult
> >
> >    put textEncode(pDelim,"UTF32") into pDelim
> >    put textEncode(pString,"UTF32") into pString
> >
> >    set the caseSensitive to pCaseSensitive is true
> >    put 0 into tPos
> >    repeat forever
> >       put offset(pDelim, pString, tPos) into tNewPos
> >       if tNewPos = 0 then exit repeat
> >       add tNewPos to tPos
> >       put tPos div 4 + tPos mod 4,"" after tResult
> >    end repeat
> >    if tResult is empty then return 0
> >    else return char 1 to -2 of tResult
> > end allOffsets
> > ----------------------------------
> >
> > It teaches me to use UTF32 to be on the safe side, thank you.
> > But that should take care of it.
> >
> > Kind regards
> > Bernd
> >
> >
> > Unfortunately, I just discovered that your solution doesn't produce
> correct
> > results. If I get the offsets of "aaaaaaaaaa" in
> > "↘𠜎aa↘𠜎aaaaaaaaaaaaa↘𠜎aaaa",
> >
> > My code (and Brian Milby's) will return: 7,8,9,10
> >
> > Your code will return: 9,10,11,12
> >
> > As I understand it, textEncode transforms unicode text into binary data,
> > which has the effect of speeding things up because LC is no longer
> dealing
> > with variable-byte-length characters, just the underlying (fixed-length)
> > binary data that makes them up. Hence the above discrepancy. At least I
> > think so. Maybe there's a way to fix it?
> >
> > gc
> >
> >
> >
> > Am 11.11.2018 um 00:00 schrieb Geoff Canyon <gcanyon at gmail.com>:
> >
> > Unfortunately, I just discovered that your solution doesn't produce
> > correct results. If I get the offsets of "aaaaaaaaaa" in
> > "↘𠜎aa↘𠜎aaaaaaaaaaaaa↘𠜎aaaa",
> >
> > My code (and Brian Milby's) will return: 7,8,9,10
> >
> > Your code will return: 9,10,11,12
> >
> > As I understand it, textEncode transforms unicode text into binary data,
> > which has the effect of speeding things up because LC is no longer
> dealing
> > with variable-byte-length characters, just the underlying (fixed-length)
> > binary data that makes them up. Hence the above discrepancy. At least I
> > think so. Maybe there's a way to fix it?
> >
> > gc
> >
> > On Sat, Nov 10, 2018 at 12:12 PM Niggemann, Bernd <
> > Bernd.Niggemann at uni-wh.de> wrote:
> >
> >> I figured that the slowdown was due to UTF8, for each char it has to
> test
> >> if it is a compounded character. So I just tried with utf16 figuring,
> that
> >> now it just compares at the byte-level.
> >>
> >> As it turned out it was indeed faster.
> >>
> >> Now I don't understand unicode but as I understand for some
> >> languages/signs/characters you need UTF32 to display them correctly. I
> may
> >> be wrong on that. But if it is true then the overhead to use UTF32 in
> >> textEncoding only adds a small amount to processing time.
> >>
> >> The nice thing is that UTF16 and UTF32 textencoding also support
> >> caseSensitivity.  ByteOffset() for UTF16 is probably always
> case-sensitive,
> >> but only saves a small amount of processing time.
> >>
> >> Also, LC apparently has to turn ASCII into UTF8 as soon as there is one
> >> non-ASCII character in the source text. In my naive understanding LC
> could
> >> internally switch to UTF16/32 for offset() as soon as it realizes that
> UTF8
> >> is in the source. Would make obsolete this workaround.
> >>
> >>
> >> This is just how I "think" it works, the explanation may be all wrong.
> >>
> >> Kind regards
> >>
> >> Bernd
> >>
> >> Am 10.11.2018 um 20:30 schrieb Geoff Canyon <gcanyon at gmail.com>:
> >>
> >> This is faster -- under some circumstances, much faster! Any idea why
> >> textEncoding suddenly fixes everything?
> >>
> >> On Sat, Nov 10, 2018 at 5:13 AM Niggemann, Bernd via use-livecode <
> >> use-livecode at lists.runrev.com> wrote:
> >>
> >>> This is a little late but there was a discussion about the slowness of
> >>> simple offset() when dealing with text that contains Unicode
> characters.
> >>>
> >>> Geoff Canyon and Brian Milby found a faster solution by setting the
> >>> itemDelimiter to the search string.
> >>> They even provided a way to find the position of substrings in the
> >>> search string which the offset() command does by design.
> >>>
> >>> Here I propose a variant of the offset() form that uses UTF16 to
> search,
> >>> easily adaptable to UTF32 if necessary.
> >>>
> >>> To test (as in Brian's testStack) add a unicode character to the text
> to
> >>> be searched e.g. at the end. Just any non-ASCII character to see the
> speed
> >>> penalty of simple offset(). I used ð (Icelandic d) or use any chinese
> >>> character.
> >>>
> >>>
> >>> Kind regards
> >>> Bernd
> >>>
> >>> -------------------------------------------
> >>> function allOffsets pDelim, pString, pCaseSensitive
> >>>    local tNewPos, tPos, tResult
> >>>
> >>>    put textEncode(pDelim,"UTF16") into pDelim
> >>>    put textEncode(pString,"UTF16") into pString
> >>>
> >>>    set the caseSensitive to pCaseSensitive is true
> >>>    put 0 into tPos
> >>>    repeat forever
> >>>       put offset(pDelim, pString, tPos) into tNewPos
> >>>       if tNewPos = 0 then exit repeat
> >>>       add tNewPos to tPos
> >>>       put tPos div 2 + tPos mod 2,"" after tResult
> >>>    end repeat
> >>>    if tResult is empty then return 0
> >>>    else return char 1 to -2 of tResult
> >>> end allOffsets
> >>> ----------------------------------------
> >>>
> >>
> >
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode



More information about the use-livecode mailing list