Unicode and chunk expressions

Dar Scott dsc at swcp.com
Tue May 17 18:51:01 EDT 2005


On May 17, 2005, at 3:25 PM, Richard Gaskin wrote:

> Forgive my ignorance, but how can UTF8 be used with two-byte systems 
> like Chinese?  I was under the impression those had to be UTF16.

Unicode is universal in that characters from many languages, language 
families, special use domains are all mapped onto the same numerical 
space.  Unless you need to import or export files in some particular 
encoding format, you don't need specialized encoding methods.

Each Unicode character is 32 bits.  Almost all the one you are likely 
to use are in the lower 16.  The number associated with the character 
(not the glyph) is the code point.

The representation of a sequence of characters in a sequence of 32-bit, 
16-bit or 8 bit values is called an encoding form.  It does not lose 
information.  It just packs it.  The encoding form is what you would 
consider when working in the computer.  Those encoding forms are 
UTF-32, UTF-16 and UTF-8.  Note that the byte order within each value 
is not specified.

However, those byte-orders have to be specified if these are viewed as 
bytes (or Transcript chars) or you are writing to a file.  There you 
need UTF-32BE (big endian), UTF-32LE, UTF-16BE, UTF-16LE and UTF-8.  
The order is not needed on UTF-8.  These are called encoding schemes.

All unicode characters are packed into UTF-8 or UTF-16.

For UTF-16, you only have the rare (and reasonably ignored) characters 
outside the BMP, the basic range.  Those are handled by special double 
values.

For UTF-8 the encoding is very clever.  All characters in the ASCII 
code range (7 bits) are represented by bytes with the high bit zero.  
All others are represented by a sequence of bytes of which the high two 
bits are 11 for the first byte of the sequence and all the others are 
10.  Also it is possible to determine the number of bytes for that 
character from the first byte.  You can read this backwards, too, so if 
Transcript goes to UTF-8, you can get char -1.

Since all the characters outside the ASCII range are represented by one 
to 4 bytes with the high bit set, you can never get a false lf or space 
or comma.  Also, '=' only considers ASCII letters in case, so you never 
get any false lever conversions for comparison.  "is a number" works 
with the usual Transcript numerals.  UTF-8 has no nulls if there is no 
null character, so you can use it as a key to an array.

There may be ways folks will fool you by putting a dot over a comma or 
space (if possible), but usually the comma and space work just the way 
you expect.  Oh, I forgot to say that tab and lf are part of the ASCII 
range.

I don't know how word thinks about characters with the high bit set, 
but I bet it thinks those are just more characters outside of white 
space, so those should work in words, even if they use some special 
codes that are special spaces.

I would expect the compiler is the same way, so a special editor can 
compile unicode string constants into UTF-8.

UTF-8 is a "language" in uniDecode and uniEncode, so you can convert 
easily.

Note that when I mention UTF-16, the normal form we get from "the 
unicodeText", I always emphasize "host-order", though that is redundant 
in a sense.  The order depends on the OS.  Because we can access those 
one byte at a time, we must then know that one is UTF-16BE and another 
might be UTF-16LE.

I think it is handicapping to think of "wide characters" or "two-byte 
systems".

Dar

-- 
**********************************************
     DSC (Dar Scott Consulting & Dar's Lab)
     http://www.swcp.com/dsc/
     Programming and software
**********************************************



More information about the use-livecode mailing list