why one char in UTF8 (3 bytes) converted to UTF16 becomes 6 bytes?

Jan Schenkel janschenkel at yahoo.com
Wed Mar 30 16:36:25 EDT 2011


--- On Wed, 3/30/11, Kee Nethery <kee at kagi.com> wrote:
> > 
> > Ideally, all the conversion would take place at the
> end-points:
> > open file <theFilePath> for text read with
> encoding <theEncoding>
> > open file <theFilePath> for text write with
> encoding <theEncoding>
> > put <theVariable> into URL <theUrl> with
> encoding <theEncoding>
> > put URL <theUrl> into <theVariable> with
> encoding <theEncoding>
> > 
> > Internally, the engine would handle everything in
> UTF-16,
> 
> :-) UTF-16 BE or UTF-16 LE?
> 

I don't care, as long as my app can read and write both, regardless of the platform it is running on ;-)

> and the with encoding, that would be what you want or what
> you have?
> 

The encoding would be whatever the incoming stuff is for reading, and whatever the outgoing stuff should be for writing. In Java, every string uses UTF-16 internally, and conversion is handled via InputStreamReaders and OutputStreamWriters.

Here's an example of using an InputStreamReader:
##
FileInputStream fis = new FileInputStream("input.txt");
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
##
which means that whatever you're reading from the file "input.txt" should be interpreted as UTF-8.

Likewise, the following example for an OutputStreamWriter:
##
FileOutputStream fos = new FileOutputStream("output.txt");
Writer out = new OutputStreamWriter(fos, "ISO-8859-1");
##
which means that whatever you're writing to the file "output.txt" should end up on the hard drive as ISO-8859-1.

And if you don't specify the charset name, it will apply a platform-specific default - which you can override with a startup parameter.
LiveCode does something similar (think of how it automatically interprets CR/LF depending on the platform) unless you read the file as binary and then do the encoding.

Anyway, in my earlier examples of how LiveCode might do it:
##
open file "input.txt" for text read with encoding "UTF-8"
##
means that the engine should interpret whatever is in the file as UTF-8 encoded.

> If I was king of LiveCode Unicode, I'd make all text UTF8.
> I'd have char refer to the characters regardless how many
> bytes are required to encode it. Bytes would give you the
> actual encoding values. I'd have a convert function to
> change text (or binary if it came in as something other than
> utf8) into something else and I'd require the from encoding
> and the to encoding to be specified. The "with encoding"
> assumes I know which side of the encoding is assumed to be
> something (the from or the to) and I don't. I'm a big fan of
> explicit rather than assumed in the code I write.
> 

That's why they introduced the 'byte' chunk type in version 3.0 - in preparation of a time where a char can be more than one byte and we wouldn't have to know or care as the engine does the right thing.
Making everything UTF-8 means you'll statistically have a harder time to figure out chunk byte ranges, as you have to check each and every byte to know whether the char is actually 1, 2, 3 or 4 bytes.
If you use UTF-16 instead, you'll eat more memory if your data stays in the ASCII range, but most character sets will fit happily into two bytes - and for the ones that do require 4 bytes instead of 2, you only need to check every other byte.

Jan Schenkel.
=====
Quartam Reports & PDF Library for LiveCode
www.quartam.com

=====
"As we grow older, we grow both wiser and more foolish at the same time."  (La Rochefoucauld)





More information about the use-livecode mailing list