Unicode sorting

Devin Asay devin_asay at byu.edu
Sat May 27 11:12:00 EDT 2006


On May 26, 2006, at 5:26 PM, Dar Scott wrote:

>
> On May 26, 2006, at 3:57 PM, Devin Asay wrote:
>
>> A 'sort lines' command, after converting upper case to lower,  
>> works fairly well, except that, curiously, a space sorts *after*  
>> all cyrillic chars.
>
>
> I think I figured out what it is.  'sort' seems to see NUL as the  
> end of the string and U+0020 has virtually a NUL in it.  Try this  
> test:
>
> on mouseUp
>   put "a" & NULL & "z" & lf & "a" & NULL & "b" into d
>   sort d
>   replace NULL with "x" in d
>   put d
> end mouseUp
> ==>
> axz
> axb
>
> We have been bitten by C again.

Hmmm. Maybe this is it. But I thought the reason was different: We  
know (I think) that when we enter unicode text in Rev that lower  
ascii range characters are still encoded as lower ascii. So space and  
cr are still ascii 32 (#20) and ascii 10 (#10). (I think RunRev do  
this so that chunk expressions like word and line still work for  
unicode text.) To see what I mean type some unicode-range text into a  
field then look at the htmlText.

The sort command is an ascii sort by default, so when rev sorts a  
field it sees all the characters, ascii or unicode, as ascii bytes,  
sort of like what you get if you have a field with unicode text in it  
and do:

   put fld "myUnicodeText" into fld "anyOldField"

When your unicode is in a lower unicode range, like my Cyrillic  
example, all of the first bytes are #00, so rev is seeing something  
like:

For the Russian (don't know if this will come thru in your email  
reader):
   Я вижу вас.
The unicode is (omitting the "U+" convention):
   042F 0020 0432 0438 0436 0443 0020 0432 0430 0441 002E

But what rev is seeing during sort is a series of single byte chars,  
with leading null bytes of basic latin range chars ignored:
   04 2F 20 04 32 04 38 04 36 04 43 20 04 32 04 30 04 41 2E

Since all of the first bytes of the Cyrillic range are 04 (< 20),  
they are always sorted *before* virtually everything in lower ascii  
range.

At least I think this is what's going on, and if so explains a lot of  
the weirdness that happens when working with unicode in Rev.

Anyway, I think the moral of the story is that we need beefed up  
unicode support in Rev, including a true sort unicode option. I've  
submitted an enhancement request for this, BZ 3646. Please consider  
voting for it if you have some votes to spare.

Devin

Devin Asay
Humanities Technology and Research Support Center
Brigham Young University




More information about the use-livecode mailing list