Cyrillic input

ron barber runrevron at gmail.com
Wed Jun 1 10:11:03 EDT 2011


Hi malte,
This is a modified function that Ken, Richard (and maybe Jacque) had a
hand in some time ago.
It does essentially the same thing that Slava suggested but I offer it
as it has helped me.

Thanks
Ron

function RawDataToUTF16 pData
   -- Examine the data to determine encoding:
   --   UTF8 has 0xEF 0xBB 0xBF
   --   UTF16BE has 0xFE 0xFF
   --   UTF16LE has 0xFF 0xFE

    switch
      case charToNum(byte 1 of pData) = 0
         put "UTF16BE" into tTextEncoding
         break
      case charToNum(byte 1 of pData) = 0xFE and charToNum(char 2 of
pData) = 0xFF
         delete byte 1 to 2 of pData
         put "UTF16BE" into tTextEncoding
         break
      case charToNum(byte 1 of pData) = 0xFF and charToNum(char 2 of
pData) = 0xFE
         delete byte 1 to 2 of pData
         put "UTF16LE" into tTextEncoding
         break
      case char 1 to 3 of pData is "Ôªø"
         put "utf8" into ttextencoding
         break

      default
         put "UTF8" into tTextEncoding
         break
   end switch
   --
   if tTextEncoding begins with "UTF16" then
      -- Check byte order, swapping if needed:
      if the processor is "x86" then
         put "LE" into tHostByteOrder
      else
         put "BE" into tHostByteOrder
      end if
      if byte -2 to -1 of tTextEncoding <> tHostByteOrder then
         put swapbytes(pData) into pData
      end if
      -- Already utf16, so nothing more needs to be done:
      #put uniEncode(uniDecode(pData, utf16),16) into tFieldData
      put pData into tFieldData
    else
      put uniEncode(pData, "utf8") into tFieldData
   end if
   -- Convert from utf8 to Rev's native utf16:
   replace uniencode("Åv","Japanese") with "**" in tFieldData
   replace CRLF with cr in tFieldData
   replace numtochar(13) with cr in tfieldData  --affects japanese ?
   replace "**"  with uniencode("Åv","Japanese")  in  tFieldData
   return tFieldData
end RawDataToUTF16


On Wed, Jun 1, 2011 at 10:56 PM, Slava Paperno <slava at lexiconbridge.com> wrote:
> Malte,
>
> As I said, I'm discovering these things as I go--I hadn't even heard of LC
> until last month. I'm finding that work with Unicode in LC involves a lot of
> jumping through hoops, but so far I have been able to do everything I
> needed. So don't give up :)
>
> I am not sure why your stack doesn't "know" whether the text in your field
> is UTF-16 or plain ANSI, but here is what I do:
>
> When I read some text from a file into a variable, I assume that it is
> UTF-8. There is no harm in that. Even if it turns out to be plain English,
> it can still be treated that way.
>
> When I assign that text to a field, I always use
>
> set the unicodeText of field MyField to uniEncode(myVar, "UTF8")
>
> Now the text in the field is UTF-16. I check to see if the first two bytes
> are decimal 255 followed by decimal 254 (or the reverse, 254 followed by
> 255), and if they are, I delete them, because that's BOM.
>
> I can read and edit the field using the system's multilanguage input, like
> the Russian keyboard in Windows. Russian and English can be typed in any
> combination, but it is still all UTF-16. Each letter and each punctuation
> mark is a two-byte sequence. If you call length(the unicodeText of field
> MyField) it will report twice the number of characters that you see in the
> field.
>
> So if I have to access character N in the field, I do this:
>
> set useUnicode to true
> put char N to char N+1 of field MyField into myChar
> answer charToNum(myChar)
> That will show you a decimal number, like 1072 if myChar is a lower case
> Cyrillic a or an ASCII number if it is an English letter.
>
> Even plain English letters must be accessed like that, as two bytes. For
> English, the first byte is a null, and the second is the ASCII of the
> letter, but you don't need to think of that. Just treat every letter as a
> two-char sequence.
>
> If the user types in that field, what he types is in UTF-16.
>
> If I need to do anything with the text in the field, like store it to a
> file, I read it into a variable:
>
> put the unicodeText of field MyField into myVar2
>
> and immediately convert it to UTF-8:
>
> put uniDecode(myVar2, "UTF16") into myVar2
>
> Now myVar2 is UTF-8 and can be stored in a file or processed by scripts.
>
> There are apparently limitations to what you can do with Cyrillic in LC, but
> the things that I have listed all work for me.
>
> Slava
>
>> -----Original Message-----
>> From: use-livecode-bounces at lists.runrev.com [mailto:use-livecode-
>> bounces at lists.runrev.com] On Behalf Of Malte Brill
>> Sent: Wednesday, June 01, 2011 9:23 AM
>> To: use-livecode at lists.runrev.com
>> Subject: Re: Re: Cyrillic input
>>
>> Thanks mark and Slava!
>>
>> well, this is getting me a bit further. Now if only I knew if I could
> reliably check if
>> the text in my field regular ASCII or UTF encoded, that would really make
> my
>> day.
>>
>> Cheers,
>>
>> malte
>>
>
>
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>




More information about the use-livecode mailing list