Unicode sorting
Dar Scott
dsc at swcp.com
Fri Jun 2 16:12:39 EDT 2006
On Jun 2, 2006, at 9:45 AM, Devin Asay wrote:
> replace "Ж" with "ж" in lList
I didn't know you could do that with the current editor. I had been
suggesting a way to do that kind of thing using UTF-8 and was hoping
an script editor publisher would pick up on it.
However, the 2.7.1 editor uses host order UTF-16, which is pretty
silly since you can end up with problems like this:
> replace """e with "т" in lList --U.C. Russ T has #0022 as
> byte 2 (= ascii quote char)
And that solution isn't quite right and isn't close on other platforms.
Not only that but strings like "Ж is zhe" are garbled. Who knows
what happens with characters in the high range of the rev traditional
host character encoding.
The right way to do this until we get full Unicode is to make this
UTF8. The bad news is that some folks might be already using this
and assuming Unicode and where it does not work, adding lots of ad
hoc fixes.
UTF-8!
Why? There are no hidden ASCII chars in UTF-8. I mean 7-bit true
ASCII. If it looks like an ASCII char, it is. All non-ASCII chars
are represented by a sequence of bytes with the high-bit set. With a
minor exceptions that can be taken care of (>= single char, format(),
etc) this means that UTF-8 with Unicode in comments and quoted
literals will parse OK. There might be a surprise, of course.
This is also why item and line parsing works fine with UTF-8. There
are no hidden commas and line ends.
Dar
More information about the use-livecode
mailing list