Unicode and chunk expressions

Richard Gaskin ambassador at fourthworld.com
Tue May 17 18:58:25 EDT 2005


Dar Scott wrote:
> 
> On May 17, 2005, at 3:25 PM, Richard Gaskin wrote:
> 
>> Forgive my ignorance, but how can UTF8 be used with two-byte systems 
>> like Chinese?  I was under the impression those had to be UTF16.
> 
> 
> Unicode is universal in that characters from many languages, language 
> families, special use domains are all mapped onto the same numerical 
> space.  Unless you need to import or export files in some particular 
> encoding format, you don't need specialized encoding methods.
> 
> Each Unicode character is 32 bits.  Almost all the one you are likely to 
> use are in the lower 16.  The number associated with the character (not 
> the glyph) is the code point.
> 
> The representation of a sequence of characters in a sequence of 32-bit, 
> 16-bit or 8 bit values is called an encoding form.  It does not lose 
> information.  It just packs it.  The encoding form is what you would 
> consider when working in the computer.  Those encoding forms are UTF-32, 
> UTF-16 and UTF-8.  Note that the byte order within each value is not 
> specified.
> 
> However, those byte-orders have to be specified if these are viewed as 
> bytes (or Transcript chars) or you are writing to a file.  There you 
> need UTF-32BE (big endian), UTF-32LE, UTF-16BE, UTF-16LE and UTF-8.  The 
> order is not needed on UTF-8.  These are called encoding schemes.
> 
> All unicode characters are packed into UTF-8 or UTF-16.
> 
> For UTF-16, you only have the rare (and reasonably ignored) characters 
> outside the BMP, the basic range.  Those are handled by special double 
> values.
> 
> For UTF-8 the encoding is very clever.  All characters in the ASCII code 
> range (7 bits) are represented by bytes with the high bit zero.  All 
> others are represented by a sequence of bytes of which the high two bits 
> are 11 for the first byte of the sequence and all the others are 10.  
> Also it is possible to determine the number of bytes for that character 
> from the first byte.  You can read this backwards, too, so if Transcript 
> goes to UTF-8, you can get char -1.
> 
> Since all the characters outside the ASCII range are represented by one 
> to 4 bytes with the high bit set, you can never get a false lf or space 
> or comma.  Also, '=' only considers ASCII letters in case, so you never 
> get any false lever conversions for comparison.  "is a number" works 
> with the usual Transcript numerals.  UTF-8 has no nulls if there is no 
> null character, so you can use it as a key to an array.
> 
> There may be ways folks will fool you by putting a dot over a comma or 
> space (if possible), but usually the comma and space work just the way 
> you expect.  Oh, I forgot to say that tab and lf are part of the ASCII 
> range.
> 
> I don't know how word thinks about characters with the high bit set, but 
> I bet it thinks those are just more characters outside of white space, 
> so those should work in words, even if they use some special codes that 
> are special spaces.
> 
> I would expect the compiler is the same way, so a special editor can 
> compile unicode string constants into UTF-8.
> 
> UTF-8 is a "language" in uniDecode and uniEncode, so you can convert 
> easily.
> 
> Note that when I mention UTF-16, the normal form we get from "the 
> unicodeText", I always emphasize "host-order", though that is redundant 
> in a sense.  The order depends on the OS.  Because we can access those 
> one byte at a time, we must then know that one is UTF-16BE and another 
> might be UTF-16LE.
> 
> I think it is handicapping to think of "wide characters" or "two-byte 
> systems".
> 
> Dar
> 

Damn fine post, Dar.  Thanks for that background.

Have you by chance made a nifty tutorial on Unicode like the ultra-cool 
one you did about messages?

-- 
  Richard Gaskin
  Fourth World Media Corporation
  ___________________________________________________________
  Ambassador at FourthWorld.com       http://www.FourthWorld.com


More information about the use-livecode mailing list