Unicode and chunk expressions
Dar Scott
dsc at swcp.com
Tue May 17 18:51:01 EDT 2005
On May 17, 2005, at 3:25 PM, Richard Gaskin wrote:
> Forgive my ignorance, but how can UTF8 be used with two-byte systems
> like Chinese? I was under the impression those had to be UTF16.
Unicode is universal in that characters from many languages, language
families, special use domains are all mapped onto the same numerical
space. Unless you need to import or export files in some particular
encoding format, you don't need specialized encoding methods.
Each Unicode character is 32 bits. Almost all the one you are likely
to use are in the lower 16. The number associated with the character
(not the glyph) is the code point.
The representation of a sequence of characters in a sequence of 32-bit,
16-bit or 8 bit values is called an encoding form. It does not lose
information. It just packs it. The encoding form is what you would
consider when working in the computer. Those encoding forms are
UTF-32, UTF-16 and UTF-8. Note that the byte order within each value
is not specified.
However, those byte-orders have to be specified if these are viewed as
bytes (or Transcript chars) or you are writing to a file. There you
need UTF-32BE (big endian), UTF-32LE, UTF-16BE, UTF-16LE and UTF-8.
The order is not needed on UTF-8. These are called encoding schemes.
All unicode characters are packed into UTF-8 or UTF-16.
For UTF-16, you only have the rare (and reasonably ignored) characters
outside the BMP, the basic range. Those are handled by special double
values.
For UTF-8 the encoding is very clever. All characters in the ASCII
code range (7 bits) are represented by bytes with the high bit zero.
All others are represented by a sequence of bytes of which the high two
bits are 11 for the first byte of the sequence and all the others are
10. Also it is possible to determine the number of bytes for that
character from the first byte. You can read this backwards, too, so if
Transcript goes to UTF-8, you can get char -1.
Since all the characters outside the ASCII range are represented by one
to 4 bytes with the high bit set, you can never get a false lf or space
or comma. Also, '=' only considers ASCII letters in case, so you never
get any false lever conversions for comparison. "is a number" works
with the usual Transcript numerals. UTF-8 has no nulls if there is no
null character, so you can use it as a key to an array.
There may be ways folks will fool you by putting a dot over a comma or
space (if possible), but usually the comma and space work just the way
you expect. Oh, I forgot to say that tab and lf are part of the ASCII
range.
I don't know how word thinks about characters with the high bit set,
but I bet it thinks those are just more characters outside of white
space, so those should work in words, even if they use some special
codes that are special spaces.
I would expect the compiler is the same way, so a special editor can
compile unicode string constants into UTF-8.
UTF-8 is a "language" in uniDecode and uniEncode, so you can convert
easily.
Note that when I mention UTF-16, the normal form we get from "the
unicodeText", I always emphasize "host-order", though that is redundant
in a sense. The order depends on the OS. Because we can access those
one byte at a time, we must then know that one is UTF-16BE and another
might be UTF-16LE.
I think it is handicapping to think of "wide characters" or "two-byte
systems".
Dar
--
**********************************************
DSC (Dar Scott Consulting & Dar's Lab)
http://www.swcp.com/dsc/
Programming and software
**********************************************
More information about the use-livecode
mailing list