How to find offsets in Unicode Text fast

Brian Milby brian at milby7.com
Sun Nov 11 11:34:20 EST 2018


I just posted an updated stack with the UTF16 and UTF32 offset variants.

I did change the search on the first card to “The” and the counts remained the same so case folding does work for ASCII values.  I would need some other test text to check other cases where Unicode case folding would be expected to work.  I’m not surprised that it works for ASCII values since UTF16/32 will just pad with null bytes which will not impact the case folding logic.

This afternoon I think I’ll add a way to turn on/off case sensitive per card.  I also want to script running tests on a single card and all cards.

Thanks,
Brian
On Nov 11, 2018, 1:51 AM -0600, Geoff Canyon via use-livecode <use-livecode at lists.runrev.com>, wrote:
> One thing I don't get is how (not) caseSensitive gets handled? Once the
> text is all binary data, is the engine really still able to look at the
> binary values for "A" and "a" and treat them as the same?
>
> On Sat, Nov 10, 2018 at 8:54 PM Brian Milby via use-livecode <
> use-livecode at lists.runrev.com> wrote:
>
> > The correct formula for UTF16 should be:
> > put tPos div 2 + 1,"" after tResult
> >
> > The correct formula for UTF32 should be:
> > put tPos div 4 + 1,"" after tResult
> >
> > If you go to card #6 of my stack that is on GitHub, it has the first
> > chapter of John that I copied from the internet. I added a single UTF(8?)
> > character to it so get the results that are listed (last char of the first
> > visible line).
> >
> > Making the change is dramatic. Offset takes 65k ms. Geoff's code and my
> > code take 6k ms. Converting to UTF16 first and using offset takes 519 ms.
> > And note that the timings include those 2 calls to convert the variables to
> > UTF16.
> >
> > Now, if I take Geoff's sample that didn't work, I get the same incorrect
> > results. The problem is that some of those characters end up needing 2
> > UTF16 codepoints to represent them. (I hope I'm using the correct
> > terminology.)
> >
> > So, the safe solution is to always use UTF32. Speed looks to be close
> > between them. On my largest text example, UTF16 took 448ms and UTF32 took
> > 584ms.
> >
> > I'll update my stack and push an update to my repo so others can check out
> > the new code.
> >
> > On Sat, Nov 10, 2018 at 6:05 PM Niggemann, Bernd <
> > Bernd.Niggemann at uni-wh.de>
> > wrote:
> >
> > > That is what I alluded to,
> > > UTF is a wild country and I don't know my ways,
> > > try
> > > -----------------------------
> > > function allOffsets pDelim, pString, pCaseSensitive
> > > local tNewPos, tPos, tResult
> > >
> > > put textEncode(pDelim,"UTF32") into pDelim
> > > put textEncode(pString,"UTF32") into pString
> > >
> > > set the caseSensitive to pCaseSensitive is true
> > > put 0 into tPos
> > > repeat forever
> > > put offset(pDelim, pString, tPos) into tNewPos
> > > if tNewPos = 0 then exit repeat
> > > add tNewPos to tPos
> > > put tPos div 4 + tPos mod 4,"" after tResult
> > > end repeat
> > > if tResult is empty then return 0
> > > else return char 1 to -2 of tResult
> > > end allOffsets
> > > ----------------------------------
> > >
> > > It teaches me to use UTF32 to be on the safe side, thank you.
> > > But that should take care of it.
> > >
> > > Kind regards
> > > Bernd
> > >
> > >
> > > Unfortunately, I just discovered that your solution doesn't produce
> > correct
> > > results. If I get the offsets of "aaaaaaaaaa" in
> > > "↘𠜎aa↘𠜎aaaaaaaaaaaaa↘𠜎aaaa",
> > >
> > > My code (and Brian Milby's) will return: 7,8,9,10
> > >
> > > Your code will return: 9,10,11,12
> > >
> > > As I understand it, textEncode transforms unicode text into binary data,
> > > which has the effect of speeding things up because LC is no longer
> > dealing
> > > with variable-byte-length characters, just the underlying (fixed-length)
> > > binary data that makes them up. Hence the above discrepancy. At least I
> > > think so. Maybe there's a way to fix it?
> > >
> > > gc
> > >
> > >
> > >
> > > Am 11.11.2018 um 00:00 schrieb Geoff Canyon <gcanyon at gmail.com>:
> > >
> > > Unfortunately, I just discovered that your solution doesn't produce
> > > correct results. If I get the offsets of "aaaaaaaaaa" in
> > > "↘𠜎aa↘𠜎aaaaaaaaaaaaa↘𠜎aaaa",
> > >
> > > My code (and Brian Milby's) will return: 7,8,9,10
> > >
> > > Your code will return: 9,10,11,12
> > >
> > > As I understand it, textEncode transforms unicode text into binary data,
> > > which has the effect of speeding things up because LC is no longer
> > dealing
> > > with variable-byte-length characters, just the underlying (fixed-length)
> > > binary data that makes them up. Hence the above discrepancy. At least I
> > > think so. Maybe there's a way to fix it?
> > >
> > > gc
> > >
> > > On Sat, Nov 10, 2018 at 12:12 PM Niggemann, Bernd <
> > > Bernd.Niggemann at uni-wh.de> wrote:
> > >
> > > > I figured that the slowdown was due to UTF8, for each char it has to
> > test
> > > > if it is a compounded character. So I just tried with utf16 figuring,
> > that
> > > > now it just compares at the byte-level.
> > > >
> > > > As it turned out it was indeed faster.
> > > >
> > > > Now I don't understand unicode but as I understand for some
> > > > languages/signs/characters you need UTF32 to display them correctly. I
> > may
> > > > be wrong on that. But if it is true then the overhead to use UTF32 in
> > > > textEncoding only adds a small amount to processing time.
> > > >
> > > > The nice thing is that UTF16 and UTF32 textencoding also support
> > > > caseSensitivity. ByteOffset() for UTF16 is probably always
> > case-sensitive,
> > > > but only saves a small amount of processing time.
> > > >
> > > > Also, LC apparently has to turn ASCII into UTF8 as soon as there is one
> > > > non-ASCII character in the source text. In my naive understanding LC
> > could
> > > > internally switch to UTF16/32 for offset() as soon as it realizes that
> > UTF8
> > > > is in the source. Would make obsolete this workaround.
> > > >
> > > >
> > > > This is just how I "think" it works, the explanation may be all wrong.
> > > >
> > > > Kind regards
> > > >
> > > > Bernd
> > > >
> > > > Am 10.11.2018 um 20:30 schrieb Geoff Canyon <gcanyon at gmail.com>:
> > > >
> > > > This is faster -- under some circumstances, much faster! Any idea why
> > > > textEncoding suddenly fixes everything?
> > > >
> > > > On Sat, Nov 10, 2018 at 5:13 AM Niggemann, Bernd via use-livecode <
> > > > use-livecode at lists.runrev.com> wrote:
> > > >
> > > > > This is a little late but there was a discussion about the slowness of
> > > > > simple offset() when dealing with text that contains Unicode
> > characters.
> > > > >
> > > > > Geoff Canyon and Brian Milby found a faster solution by setting the
> > > > > itemDelimiter to the search string.
> > > > > They even provided a way to find the position of substrings in the
> > > > > search string which the offset() command does by design.
> > > > >
> > > > > Here I propose a variant of the offset() form that uses UTF16 to
> > search,
> > > > > easily adaptable to UTF32 if necessary.
> > > > >
> > > > > To test (as in Brian's testStack) add a unicode character to the text
> > to
> > > > > be searched e.g. at the end. Just any non-ASCII character to see the
> > speed
> > > > > penalty of simple offset(). I used ð (Icelandic d) or use any chinese
> > > > > character.
> > > > >
> > > > >
> > > > > Kind regards
> > > > > Bernd
> > > > >
> > > > > -------------------------------------------
> > > > > function allOffsets pDelim, pString, pCaseSensitive
> > > > > local tNewPos, tPos, tResult
> > > > >
> > > > > put textEncode(pDelim,"UTF16") into pDelim
> > > > > put textEncode(pString,"UTF16") into pString
> > > > >
> > > > > set the caseSensitive to pCaseSensitive is true
> > > > > put 0 into tPos
> > > > > repeat forever
> > > > > put offset(pDelim, pString, tPos) into tNewPos
> > > > > if tNewPos = 0 then exit repeat
> > > > > add tNewPos to tPos
> > > > > put tPos div 2 + tPos mod 2,"" after tResult
> > > > > end repeat
> > > > > if tResult is empty then return 0
> > > > > else return char 1 to -2 of tResult
> > > > > end allOffsets
> > > > > ----------------------------------------
> > > > >
> > > >
> > >
> > _______________________________________________
> > use-livecode mailing list
> > use-livecode at lists.runrev.com
> > Please visit this url to subscribe, unsubscribe and manage your
> > subscription preferences:
> > http://lists.runrev.com/mailman/listinfo/use-livecode
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode



More information about the use-livecode mailing list