How to find offsets in Unicode Text fast

Niggemann, Bernd Bernd.Niggemann at
Sat Nov 10 08:12:44 EST 2018

This is a little late but there was a discussion about the slowness of simple offset() when dealing with text that contains Unicode characters.

Geoff Canyon and Brian Milby found a faster solution by setting the itemDelimiter to the search string.
They even provided a way to find the position of substrings in the search string which the offset() command does by design.

Here I propose a variant of the offset() form that uses UTF16 to search, easily adaptable to UTF32 if necessary.

To test (as in Brian's testStack) add a unicode character to the text to be searched e.g. at the end. Just any non-ASCII character to see the speed penalty of simple offset(). I used ð (Icelandic d) or use any chinese character.

Kind regards

function allOffsets pDelim, pString, pCaseSensitive
   local tNewPos, tPos, tResult
   put textEncode(pDelim,"UTF16") into pDelim
   put textEncode(pString,"UTF16") into pString
   set the caseSensitive to pCaseSensitive is true
   put 0 into tPos
   repeat forever
      put offset(pDelim, pString, tPos) into tNewPos
      if tNewPos = 0 then exit repeat
      add tNewPos to tPos
      put tPos div 2 + tPos mod 2,"" after tResult
   end repeat
   if tResult is empty then return 0
   else return char 1 to -2 of tResult
end allOffsets

More information about the use-livecode mailing list