Unicode and chunk expressions
Richard Gaskin
ambassador at fourthworld.com
Tue May 17 18:58:25 EDT 2005
Dar Scott wrote:
>
> On May 17, 2005, at 3:25 PM, Richard Gaskin wrote:
>
>> Forgive my ignorance, but how can UTF8 be used with two-byte systems
>> like Chinese? I was under the impression those had to be UTF16.
>
>
> Unicode is universal in that characters from many languages, language
> families, special use domains are all mapped onto the same numerical
> space. Unless you need to import or export files in some particular
> encoding format, you don't need specialized encoding methods.
>
> Each Unicode character is 32 bits. Almost all the one you are likely to
> use are in the lower 16. The number associated with the character (not
> the glyph) is the code point.
>
> The representation of a sequence of characters in a sequence of 32-bit,
> 16-bit or 8 bit values is called an encoding form. It does not lose
> information. It just packs it. The encoding form is what you would
> consider when working in the computer. Those encoding forms are UTF-32,
> UTF-16 and UTF-8. Note that the byte order within each value is not
> specified.
>
> However, those byte-orders have to be specified if these are viewed as
> bytes (or Transcript chars) or you are writing to a file. There you
> need UTF-32BE (big endian), UTF-32LE, UTF-16BE, UTF-16LE and UTF-8. The
> order is not needed on UTF-8. These are called encoding schemes.
>
> All unicode characters are packed into UTF-8 or UTF-16.
>
> For UTF-16, you only have the rare (and reasonably ignored) characters
> outside the BMP, the basic range. Those are handled by special double
> values.
>
> For UTF-8 the encoding is very clever. All characters in the ASCII code
> range (7 bits) are represented by bytes with the high bit zero. All
> others are represented by a sequence of bytes of which the high two bits
> are 11 for the first byte of the sequence and all the others are 10.
> Also it is possible to determine the number of bytes for that character
> from the first byte. You can read this backwards, too, so if Transcript
> goes to UTF-8, you can get char -1.
>
> Since all the characters outside the ASCII range are represented by one
> to 4 bytes with the high bit set, you can never get a false lf or space
> or comma. Also, '=' only considers ASCII letters in case, so you never
> get any false lever conversions for comparison. "is a number" works
> with the usual Transcript numerals. UTF-8 has no nulls if there is no
> null character, so you can use it as a key to an array.
>
> There may be ways folks will fool you by putting a dot over a comma or
> space (if possible), but usually the comma and space work just the way
> you expect. Oh, I forgot to say that tab and lf are part of the ASCII
> range.
>
> I don't know how word thinks about characters with the high bit set, but
> I bet it thinks those are just more characters outside of white space,
> so those should work in words, even if they use some special codes that
> are special spaces.
>
> I would expect the compiler is the same way, so a special editor can
> compile unicode string constants into UTF-8.
>
> UTF-8 is a "language" in uniDecode and uniEncode, so you can convert
> easily.
>
> Note that when I mention UTF-16, the normal form we get from "the
> unicodeText", I always emphasize "host-order", though that is redundant
> in a sense. The order depends on the OS. Because we can access those
> one byte at a time, we must then know that one is UTF-16BE and another
> might be UTF-16LE.
>
> I think it is handicapping to think of "wide characters" or "two-byte
> systems".
>
> Dar
>
Damn fine post, Dar. Thanks for that background.
Have you by chance made a nifty tutorial on Unicode like the ultra-cool
one you did about messages?
--
Richard Gaskin
Fourth World Media Corporation
___________________________________________________________
Ambassador at FourthWorld.com http://www.FourthWorld.com
More information about the use-livecode
mailing list