How to find offsets in Unicode Text fast

Niggemann, Bernd Bernd.Niggemann at uni-wh.de
Sat Nov 10 08:12:44 EST 2018


This is a little late but there was a discussion about the slowness of simple offset() when dealing with text that contains Unicode characters.

Geoff Canyon and Brian Milby found a faster solution by setting the itemDelimiter to the search string.
They even provided a way to find the position of substrings in the search string which the offset() command does by design.

Here I propose a variant of the offset() form that uses UTF16 to search, easily adaptable to UTF32 if necessary.

To test (as in Brian's testStack) add a unicode character to the text to be searched e.g. at the end. Just any non-ASCII character to see the speed penalty of simple offset(). I used ð (Icelandic d) or use any chinese character.


Kind regards
Bernd

-------------------------------------------
function allOffsets pDelim, pString, pCaseSensitive
   local tNewPos, tPos, tResult
   
   put textEncode(pDelim,"UTF16") into pDelim
   put textEncode(pString,"UTF16") into pString
   
   set the caseSensitive to pCaseSensitive is true
   put 0 into tPos
   repeat forever
      put offset(pDelim, pString, tPos) into tNewPos
      if tNewPos = 0 then exit repeat
      add tNewPos to tPos
      put tPos div 2 + tPos mod 2,"" after tResult
   end repeat
   if tResult is empty then return 0
   else return char 1 to -2 of tResult
end allOffsets
-----------------------------------------


More information about the use-livecode mailing list