How to find offsets in Unicode Text fast
gcanyon at gmail.com
Sat Nov 10 18:00:32 EST 2018
Unfortunately, I just discovered that your solution doesn't produce correct
results. If I get the offsets of "aaaaaaaaaa" in
My code (and Brian Milby's) will return: 7,8,9,10
Your code will return: 9,10,11,12
As I understand it, textEncode transforms unicode text into binary data,
which has the effect of speeding things up because LC is no longer dealing
with variable-byte-length characters, just the underlying (fixed-length)
binary data that makes them up. Hence the above discrepancy. At least I
think so. Maybe there's a way to fix it?
On Sat, Nov 10, 2018 at 12:12 PM Niggemann, Bernd <Bernd.Niggemann at uni-wh.de>
> I figured that the slowdown was due to UTF8, for each char it has to test
> if it is a compounded character. So I just tried with utf16 figuring, that
> now it just compares at the byte-level.
> As it turned out it was indeed faster.
> Now I don't understand unicode but as I understand for some
> languages/signs/characters you need UTF32 to display them correctly. I may
> be wrong on that. But if it is true then the overhead to use UTF32 in
> textEncoding only adds a small amount to processing time.
> The nice thing is that UTF16 and UTF32 textencoding also support
> caseSensitivity. ByteOffset() for UTF16 is probably always case-sensitive,
> but only saves a small amount of processing time.
> Also, LC apparently has to turn ASCII into UTF8 as soon as there is one
> non-ASCII character in the source text. In my naive understanding LC could
> internally switch to UTF16/32 for offset() as soon as it realizes that UTF8
> is in the source. Would make obsolete this workaround.
> This is just how I "think" it works, the explanation may be all wrong.
> Kind regards
> Am 10.11.2018 um 20:30 schrieb Geoff Canyon <gcanyon at gmail.com>:
> This is faster -- under some circumstances, much faster! Any idea why
> textEncoding suddenly fixes everything?
> On Sat, Nov 10, 2018 at 5:13 AM Niggemann, Bernd via use-livecode <
> use-livecode at lists.runrev.com> wrote:
>> This is a little late but there was a discussion about the slowness of
>> simple offset() when dealing with text that contains Unicode characters.
>> Geoff Canyon and Brian Milby found a faster solution by setting the
>> itemDelimiter to the search string.
>> They even provided a way to find the position of substrings in the search
>> string which the offset() command does by design.
>> Here I propose a variant of the offset() form that uses UTF16 to search,
>> easily adaptable to UTF32 if necessary.
>> To test (as in Brian's testStack) add a unicode character to the text to
>> be searched e.g. at the end. Just any non-ASCII character to see the speed
>> penalty of simple offset(). I used ð (Icelandic d) or use any chinese
>> Kind regards
>> function allOffsets pDelim, pString, pCaseSensitive
>> local tNewPos, tPos, tResult
>> put textEncode(pDelim,"UTF16") into pDelim
>> put textEncode(pString,"UTF16") into pString
>> set the caseSensitive to pCaseSensitive is true
>> put 0 into tPos
>> repeat forever
>> put offset(pDelim, pString, tPos) into tNewPos
>> if tNewPos = 0 then exit repeat
>> add tNewPos to tPos
>> put tPos div 2 + tPos mod 2,"" after tResult
>> end repeat
>> if tResult is empty then return 0
>> else return char 1 to -2 of tResult
>> end allOffsets
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
More information about the Use-livecode