Character Encodings and Livecode fields

Graham Samuel livfoss at mac.com
Sun Jan 26 13:29:26 EST 2014


Thanks Fraser, I had just started to understand this myself by browsing the actual Unicode Standard (6.2). The obvious objection that a bit-twiddler like me would make - "how do you know how many bytes are being used to represent a particular character in UTF-8?" - is answered  by the fact (as I understand it, looking at it for the first time) that the first bit of the first byte of any sequence is interpreted as a flag saying how many of the following bytes belong in the sequence: this bit has a predefined meaning - 0 means "no more bytes in this sequence", 1 means "expect some more bytes, and look at their own top bits to see how many" and the subsequent bytes (if any) also have flags like this embedded in them. This is clearly a bit tricky to interpret (looks like it's fairly easy to get lost, for example if a byte gets missed from the sequence), but at least it explains how you can get a variable number of bytes in the encoding.

Light dawns very slowly. I am glad I am not writing a Unicode word processor. I am still very far from understanding how LC goes about handling Unicode, and how 7.x will differ from the 6.5.x we have now, and how, if I put something in UTF-8 onto the clipboard, LC will be able to transform it into UTF-16.

There's a lot to learn.

Graham


On 26 Jan 2014, at 19:02, Fraser Gordon <fraser.gordon at runrev.com> wrote:

> On 26/01/2014 17:31, Richmond wrote:
>> I'm not sure that ALL Unicode chars are double-byte ones; possibly the
>> first 255 are not.
> It depends on the encoding. In UTF-16 encoding, all characters are
> either 2 bytes or 4 bytes. In UTF-8, they can be 1 (for the first 128
> characters), 2, 3 or 4 bytes long (depending on the character). LiveCode
> 6.x uses UTF-16 and should consequently have 2 byte unicode characters.
> 
> Regards,
> Fraser
> 





More information about the use-livecode mailing list