why one char in UTF8 (3 bytes) converted to UTF16 becomes 6 bytes?
kee at kagi.com
Wed Mar 30 15:33:52 EDT 2011
> Ideally, all the conversion would take place at the end-points:
> open file <theFilePath> for text read with encoding <theEncoding>
> open file <theFilePath> for text write with encoding <theEncoding>
> put <theVariable> into URL <theUrl> with encoding <theEncoding>
> put URL <theUrl> into <theVariable> with encoding <theEncoding>
> Internally, the engine would handle everything in UTF-16,
:-) UTF-16 BE or UTF-16 LE?
and the with encoding, that would be what you want or what you have?
If I was king of LiveCode Unicode, I'd make all text UTF8. I'd have char refer to the characters regardless how many bytes are required to encode it. Bytes would give you the actual encoding values. I'd have a convert function to change text (or binary if it came in as something other than utf8) into something else and I'd require the from encoding and the to encoding to be specified. The "with encoding" assumes I know which side of the encoding is assumed to be something (the from or the to) and I don't. I'm a big fan of explicit rather than assumed in the code I write.
> or whatever is most appropriate and efficient; but reading and writing data to and from files, databases, etc. in another encoding should be as transparent as possible.
From what I've seen, most platform independent text is UTF8 because it can typically be read by any text editor and it does not have the need for the Byte Order Marker nonsense.
But ... I agree with the previous response, gots to deal with it the way it is.
Still pondering the lines of data when it is UTF16.
More information about the use-livecode