Cyrillic input

Slava Paperno slava at lexiconbridge.com
Wed Jun 1 10:42:51 EDT 2011


Brilliant! Thanks, Ron. Very educational for me.

By the way, Malt--there was a bug in my post from half an hour ago. To
convert from the UTF16 in the field back to UTF-8 for storing or
manipulating, use this:

uniDecode(myVar, "UTF8")

I had "UTF16" there, and Dave C. pointed it out in a separate (related)
thread--which saved my day.

Slava

> -----Original Message-----
> From: use-livecode-bounces at lists.runrev.com [mailto:use-livecode-
> bounces at lists.runrev.com] On Behalf Of ron barber
> Sent: Wednesday, June 01, 2011 10:11 AM
> To: How to use LiveCode
> Subject: Re: Cyrillic input
> 
> Hi malte,
> This is a modified function that Ken, Richard (and maybe Jacque) had a
hand in
> some time ago.
> It does essentially the same thing that Slava suggested but I offer it as
it has
> helped me.
> 
> Thanks
> Ron
> 
> function RawDataToUTF16 pData
>    -- Examine the data to determine encoding:
>    --   UTF8 has 0xEF 0xBB 0xBF
>    --   UTF16BE has 0xFE 0xFF
>    --   UTF16LE has 0xFF 0xFE
> 
>     switch
>       case charToNum(byte 1 of pData) = 0
>          put "UTF16BE" into tTextEncoding
>          break
>       case charToNum(byte 1 of pData) = 0xFE and charToNum(char 2 of
> pData) = 0xFF
>          delete byte 1 to 2 of pData
>          put "UTF16BE" into tTextEncoding
>          break
>       case charToNum(byte 1 of pData) = 0xFF and charToNum(char 2 of
> pData) = 0xFE
>          delete byte 1 to 2 of pData
>          put "UTF16LE" into tTextEncoding
>          break
>       case char 1 to 3 of pData is "Ôªø"
>          put "utf8" into ttextencoding
>          break
> 
>       default
>          put "UTF8" into tTextEncoding
>          break
>    end switch
>    --
>    if tTextEncoding begins with "UTF16" then
>       -- Check byte order, swapping if needed:
>       if the processor is "x86" then
>          put "LE" into tHostByteOrder
>       else
>          put "BE" into tHostByteOrder
>       end if
>       if byte -2 to -1 of tTextEncoding <> tHostByteOrder then
>          put swapbytes(pData) into pData
>       end if
>       -- Already utf16, so nothing more needs to be done:
>       #put uniEncode(uniDecode(pData, utf16),16) into tFieldData
>       put pData into tFieldData
>     else
>       put uniEncode(pData, "utf8") into tFieldData
>    end if
>    -- Convert from utf8 to Rev's native utf16:
>    replace uniencode("Åv","Japanese") with "**" in tFieldData
>    replace CRLF with cr in tFieldData
>    replace numtochar(13) with cr in tfieldData  --affects japanese ?
>    replace "**"  with uniencode("Åv","Japanese")  in  tFieldData
>    return tFieldData
> end RawDataToUTF16
> 
> 
> On Wed, Jun 1, 2011 at 10:56 PM, Slava  wrote:
> > Malte,
> >
> > As I said, I'm discovering these things as I go--I hadn't even heard
> > of LC until last month. I'm finding that work with Unicode in LC
> > involves a lot of jumping through hoops, but so far I have been able
> > to do everything I needed. So don't give up :)
> >
> > I am not sure why your stack doesn't "know" whether the text in your
> > field is UTF-16 or plain ANSI, but here is what I do:
> >
> > When I read some text from a file into a variable, I assume that it is
> > UTF-8. There is no harm in that. Even if it turns out to be plain
> > English, it can still be treated that way.
> >
> > When I assign that text to a field, I always use
> >
> > set the unicodeText of field MyField to uniEncode(myVar, "UTF8")
> >
> > Now the text in the field is UTF-16. I check to see if the first two
> > bytes are decimal 255 followed by decimal 254 (or the reverse, 254
> > followed by 255), and if they are, I delete them, because that's BOM.
> >
> > I can read and edit the field using the system's multilanguage input,
> > like the Russian keyboard in Windows. Russian and English can be typed
> > in any combination, but it is still all UTF-16. Each letter and each
> > punctuation mark is a two-byte sequence. If you call length(the
> > unicodeText of field
> > MyField) it will report twice the number of characters that you see in
> > the field.
> >
> > So if I have to access character N in the field, I do this:
> >
> > set useUnicode to true
> > put char N to char N+1 of field MyField into myChar answer
> > charToNum(myChar) That will show you a decimal number, like 1072 if
> > myChar is a lower case Cyrillic a or an ASCII number if it is an
> > English letter.
> >
> > Even plain English letters must be accessed like that, as two bytes.
> > For English, the first byte is a null, and the second is the ASCII of
> > the letter, but you don't need to think of that. Just treat every
> > letter as a two-char sequence.
> >
> > If the user types in that field, what he types is in UTF-16.
> >
> > If I need to do anything with the text in the field, like store it to
> > a file, I read it into a variable:
> >
> > put the unicodeText of field MyField into myVar2
> >
> > and immediately convert it to UTF-8:
> >
> > put uniDecode(myVar2, "UTF16") into myVar2==> CORECTION: should be
uniDecode(myVar2, "UTF8")
> >
> > Now myVar2 is UTF-8 and can be stored in a file or processed by scripts.
> >
> > There are apparently limitations to what you can do with Cyrillic in
> > LC, but the things that I have listed all work for me.
> >
> > Slava
> >
> >> -----Original Message-----
> >> From: use-livecode-bounces at lists.runrev.com [mailto:use-livecode-
> >> bounces at lists.runrev.com] On Behalf Of Malte Brill
> >> Sent: Wednesday, June 01, 2011 9:23 AM
> >> To: use-livecode at lists.runrev.com
> >> Subject: Re: Re: Cyrillic input
> >>
> >> Thanks mark and Slava!
> >>
> >> well, this is getting me a bit further. Now if only I knew if I could
> > reliably check if
> >> the text in my field regular ASCII or UTF encoded, that would really
> >> make
> > my
> >> day.
> >>
> >> Cheers,
> >>
> >> malte
> >>
> >
> >
> >
> > _______________________________________________
> > use-livecode mailing list
> > use-livecode at lists.runrev.com
> > Please visit this url to subscribe, unsubscribe and manage your
subscription
> preferences:
> > http://lists.runrev.com/mailman/listinfo/use-livecode
> >
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
subscription
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode





More information about the use-livecode mailing list