Unicode sorting

Dar Scott dsc at swcp.com
Fri Jun 2 16:12:39 EDT 2006


On Jun 2, 2006, at 9:45 AM, Devin Asay wrote:

>   replace "Ж" with "ж" in lList

I didn't know you could do that with the current editor.  I had been  
suggesting a way to do that kind of thing using UTF-8 and was hoping  
an script editor publisher would pick up on it.

However, the 2.7.1 editor uses host order UTF-16, which is pretty  
silly since you can end up with problems like this:

>   replace ""&quote with "т" in lList --U.C. Russ T has #0022 as  
> byte 2 (= ascii quote char)

And that solution isn't quite right and isn't close on other platforms.

Not only that but strings like "Ж is zhe" are garbled.  Who knows  
what happens with characters in the high range of the rev traditional  
host character encoding.

The right way to do this until we get full Unicode is to make this  
UTF8.  The bad news is that some folks might be already using this  
and assuming Unicode and where it does not work, adding lots of ad  
hoc fixes.

UTF-8!

Why?  There are no hidden ASCII chars in UTF-8.  I mean 7-bit true  
ASCII.  If it looks like an ASCII char, it is.  All non-ASCII chars  
are represented by a sequence of bytes with the high-bit set.  With a  
minor exceptions that can be taken care of (>= single char, format(),  
etc) this means that UTF-8 with Unicode in comments and quoted  
literals will parse OK.  There might be a surprise, of course.

This is also why item and line parsing works fine with UTF-8.  There  
are no hidden commas and line ends.

Dar






More information about the use-livecode mailing list