offset() functions and Unicode: SOLUTIONS

Slava Paperno slava.paperno at cornell.edu
Sun Jun 19 22:30:56 EDT 2011


I thought I would broadcast some good news for a change.

Good News 1) I was prepared to see that the offset() function is useless
with bilingual text (i.e. a mix of Roman and non-Roman, double-byte
characters) for the same reason as mouseCharChunk(), but no, it works fine.
I guess the mouseCharChunk hasn't been updated the way offset() has. If I am
wrong, please let us know.

Bad News 1) wordOffset() works for some words, but fails if your text
contains an upper case Russian R (decimal 1056, bytes 32 and 4), and
probably other similarly confusing bytes. Don't use it for Unicode.

Good News 2) The "find " command works fine in bilingual texts, perhaps
because it matches a string and doesn't tell us where the match is. Even the
repeated "find " works fine and finds (and highlights) the next occurrence.

Good News 3) Although the replace() function fails for bilingual fields, you
can work around the problem like so:

on mouseUp
   --this handler amounts to a home-made custom replace() function for
UTF-16 fields;
   
   --it searches field InputField for word 1 from field WordToFind
   --and replaces it with  word 2 from field WordToFind;
   
   --here is what we do:
   --1) text in field WordToFind is stored in a variable and converted to
UTF-8
   --2) word 1 and word 2 are retrieved from that UTF8 variable (the can't
be reliably retrieved directly from the field)
   --3) text from InputField (the field to search in) is stored in a
variable and converted to UTF8
   --4) offset() is called to find the position of the search target in the
UTF8 input string
   --5) the final result created by concatenating the text in the input
string up to the offset & the replacement string & the text in the input
string that follows the search target
   --6) this final result is converted to UTF-16 and displayed in the field
   
   local locInputText
   local locFindReplaceText
   local locStrToFind, locReplacementStr
   local locOffset   
   local locHead, locTail
   
   set caseSensitive to true
   
   put the unicodeText of field "WordToFind" of this card into
locFindReplaceText
   put uniDecode(locFindReplaceText, "UTF8") into locFindReplaceText
   put word 1 of locFindReplaceText into locStrToFind --this is UTF8;
although word 1 would have been retrieved successfully form the original
UTF16 string,
   -- word 2 and later words would not, especially if some of the
double-byte characters happened to be byteNum 32 followed by byteNum X, like
the Russian upper case R (decimal 1056)
   put word 2 of locFindReplaceText into locReplacementStr
   
   --this direct approach will not work with Unicode:
   --   replace locStrToFind with locReplacementStr in field "InputField" of
this card
   
   --until LC engineers create a replaceUnicode command, use this approach:
   put the unicodeText of  field "InputField" of this card into locInputText
--UTF16
   put uniDecode(locInputText, "UTF8") into locInputText --UTF8
   
   put offset(locStrToFind, locInputText) into locOffset
   
   if (locOffset is not an integer)  or (locOffset is 0)  then
      set the unicodeText of field "SearchResult" to uniEncode("Your word 1
was not found.", "UTF8")
      exit mouseUp
   end if
   
   put char 1 to (locOffset - 1) of locInputText into locHead
   put char (locOffset + length(locStrToFind)) to -1 of locInputText into
locTail
   
   put locHead & locReplacementStr & locTail into locInputText --UTF8
   
   set the unicodeText of field "InputField" of this card to
uniEncode(locInputText, "UTF8")
end mouseUp

Slava






More information about the use-livecode mailing list