How to find offsets in Unicode Text fast
Geoff Canyon
gcanyon at gmail.com
Sat Nov 10 14:30:11 EST 2018
This is faster -- under some circumstances, much faster! Any idea why
textEncoding suddenly fixes everything?
On Sat, Nov 10, 2018 at 5:13 AM Niggemann, Bernd via use-livecode <
use-livecode at lists.runrev.com> wrote:
> This is a little late but there was a discussion about the slowness of
> simple offset() when dealing with text that contains Unicode characters.
>
> Geoff Canyon and Brian Milby found a faster solution by setting the
> itemDelimiter to the search string.
> They even provided a way to find the position of substrings in the search
> string which the offset() command does by design.
>
> Here I propose a variant of the offset() form that uses UTF16 to search,
> easily adaptable to UTF32 if necessary.
>
> To test (as in Brian's testStack) add a unicode character to the text to
> be searched e.g. at the end. Just any non-ASCII character to see the speed
> penalty of simple offset(). I used ð (Icelandic d) or use any chinese
> character.
>
>
> Kind regards
> Bernd
>
> -------------------------------------------
> function allOffsets pDelim, pString, pCaseSensitive
> local tNewPos, tPos, tResult
>
> put textEncode(pDelim,"UTF16") into pDelim
> put textEncode(pString,"UTF16") into pString
>
> set the caseSensitive to pCaseSensitive is true
> put 0 into tPos
> repeat forever
> put offset(pDelim, pString, tPos) into tNewPos
> if tNewPos = 0 then exit repeat
> add tNewPos to tPos
> put tPos div 2 + tPos mod 2,"" after tResult
> end repeat
> if tResult is empty then return 0
> else return char 1 to -2 of tResult
> end allOffsets
> -----------------------------------------
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list