why one char in UTF8 (3 bytes) converted to UTF16 becomes 6 bytes?

Kee Nethery kee at kagi.com
Tue Mar 29 21:30:16 EDT 2011


I have the don't sign symbol (Combining enclosing circle backslash) in a text file that I read into livecode. For grins, the character between "Petro" and "Max" seen below.

Petro⃠Max

When I scan the bytes, in UTF8, this is encoded as: 226 131 160 also known as E2 83 A0. This is the correct UTF8 encoding for this character.

When I convert this to UTF16 using

uniencode(theUtf8Text) or uniencode(theUtf8Text,"UTF16") the byte values are: 26 32 201 0 32 32

A unicode character in UTF16 should either be stored as two bytes or four bytes but never 6 bytes. According to the unicode spec, the characters that require 4 bytes are pretty uncommon and I'm willing to ignore the error they will create if the data stream ever contains them. But the thing I'm trying to do is count characters on a line and my single character looks like three when converted to UTF16.

Any suggestions on how to get a UTF8 character to correctly convert to UTF16?

Kee Nethery



More information about the use-livecode mailing list