Cyrillic input

ron barber runrevron at
Wed Jun 1 10:11:03 EDT 2011

Hi malte,
This is a modified function that Ken, Richard (and maybe Jacque) had a
hand in some time ago.
It does essentially the same thing that Slava suggested but I offer it
as it has helped me.


function RawDataToUTF16 pData
   -- Examine the data to determine encoding:
   --   UTF8 has 0xEF 0xBB 0xBF
   --   UTF16BE has 0xFE 0xFF
   --   UTF16LE has 0xFF 0xFE

      case charToNum(byte 1 of pData) = 0
         put "UTF16BE" into tTextEncoding
      case charToNum(byte 1 of pData) = 0xFE and charToNum(char 2 of
pData) = 0xFF
         delete byte 1 to 2 of pData
         put "UTF16BE" into tTextEncoding
      case charToNum(byte 1 of pData) = 0xFF and charToNum(char 2 of
pData) = 0xFE
         delete byte 1 to 2 of pData
         put "UTF16LE" into tTextEncoding
      case char 1 to 3 of pData is "Ôªø"
         put "utf8" into ttextencoding

         put "UTF8" into tTextEncoding
   end switch
   if tTextEncoding begins with "UTF16" then
      -- Check byte order, swapping if needed:
      if the processor is "x86" then
         put "LE" into tHostByteOrder
         put "BE" into tHostByteOrder
      end if
      if byte -2 to -1 of tTextEncoding <> tHostByteOrder then
         put swapbytes(pData) into pData
      end if
      -- Already utf16, so nothing more needs to be done:
      #put uniEncode(uniDecode(pData, utf16),16) into tFieldData
      put pData into tFieldData
      put uniEncode(pData, "utf8") into tFieldData
   end if
   -- Convert from utf8 to Rev's native utf16:
   replace uniencode("Åv","Japanese") with "**" in tFieldData
   replace CRLF with cr in tFieldData
   replace numtochar(13) with cr in tfieldData  --affects japanese ?
   replace "**"  with uniencode("Åv","Japanese")  in  tFieldData
   return tFieldData
end RawDataToUTF16

On Wed, Jun 1, 2011 at 10:56 PM, Slava Paperno <slava at> wrote:
> Malte,
> As I said, I'm discovering these things as I go--I hadn't even heard of LC
> until last month. I'm finding that work with Unicode in LC involves a lot of
> jumping through hoops, but so far I have been able to do everything I
> needed. So don't give up :)
> I am not sure why your stack doesn't "know" whether the text in your field
> is UTF-16 or plain ANSI, but here is what I do:
> When I read some text from a file into a variable, I assume that it is
> UTF-8. There is no harm in that. Even if it turns out to be plain English,
> it can still be treated that way.
> When I assign that text to a field, I always use
> set the unicodeText of field MyField to uniEncode(myVar, "UTF8")
> Now the text in the field is UTF-16. I check to see if the first two bytes
> are decimal 255 followed by decimal 254 (or the reverse, 254 followed by
> 255), and if they are, I delete them, because that's BOM.
> I can read and edit the field using the system's multilanguage input, like
> the Russian keyboard in Windows. Russian and English can be typed in any
> combination, but it is still all UTF-16. Each letter and each punctuation
> mark is a two-byte sequence. If you call length(the unicodeText of field
> MyField) it will report twice the number of characters that you see in the
> field.
> So if I have to access character N in the field, I do this:
> set useUnicode to true
> put char N to char N+1 of field MyField into myChar
> answer charToNum(myChar)
> That will show you a decimal number, like 1072 if myChar is a lower case
> Cyrillic a or an ASCII number if it is an English letter.
> Even plain English letters must be accessed like that, as two bytes. For
> English, the first byte is a null, and the second is the ASCII of the
> letter, but you don't need to think of that. Just treat every letter as a
> two-char sequence.
> If the user types in that field, what he types is in UTF-16.
> If I need to do anything with the text in the field, like store it to a
> file, I read it into a variable:
> put the unicodeText of field MyField into myVar2
> and immediately convert it to UTF-8:
> put uniDecode(myVar2, "UTF16") into myVar2
> Now myVar2 is UTF-8 and can be stored in a file or processed by scripts.
> There are apparently limitations to what you can do with Cyrillic in LC, but
> the things that I have listed all work for me.
> Slava
>> -----Original Message-----
>> From: use-livecode-bounces at [mailto:use-livecode-
>> bounces at] On Behalf Of Malte Brill
>> Sent: Wednesday, June 01, 2011 9:23 AM
>> To: use-livecode at
>> Subject: Re: Re: Cyrillic input
>> Thanks mark and Slava!
>> well, this is getting me a bit further. Now if only I knew if I could
> reliably check if
>> the text in my field regular ASCII or UTF encoded, that would really make
> my
>> day.
>> Cheers,
>> malte
> _______________________________________________
> use-livecode mailing list
> use-livecode at
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:

More information about the Use-livecode mailing list