Unicode sorting

Devin Asay devin_asay at byu.edu
Tue May 30 10:48:55 EDT 2006


Dar,
Thanks for the code! I'll test it at earliest opportunity.
Devin

On May 27, 2006, at 2:05 PM, Dar Scott wrote:

>
> On May 27, 2006, at 9:12 AM, Devin Asay wrote:
>
>>
>> For the Russian (don't know if this will come thru in your email  
>> reader):
>>   Я вижу вас.
>> The unicode is (omitting the "U+" convention):
>>   042F 0020 0432 0438 0436 0443 0020 0432 0430 0441 002E
>>
>> But what rev is seeing during sort is a series of single byte  
>> chars, with leading null bytes of basic latin range chars ignored:
>>   04 2F 20 04 32 04 38 04 36 04 43 20 04 32 04 30 04 41 2E
>>
>> Since all of the first bytes of the Cyrillic range are 04 (< 20),  
>> they are always sorted *before* virtually everything in lower  
>> ascii range.
>
> The letters came through.
>
> But the NUL characters are not dropped unless you are doing  
> something to drop them.  But if they are dropped, then that will  
> happen.

My assumption is that the sort command truncates or ignores them,  
just as the conversion to htmlText does. (I may be completely wrong  
about this, but that's sure what seems to happen.)

>
> I see you have a period in your characters.  Perhaps you have other  
> characters outside Russian Cyrillic.

In my data set lines that begin with ascii-range punctutation, like  
parends or periods, all sort to the end of the list. When I grab my  
data from MySQL, where it is saved as UTF-8, using a SELECT ... ORDER  
BY query and order by the Russian words, ascii range stuff comes out  
sorted to the beginning, as I would expect.
>
> I forgot about the line ends in sort.  Those can come up in the  
> middle of Unicode characters in general.
>
> I wonder if a sort of the utf8 of the Cyrillic to-lower would be  
> close.  The idea below is probably better in general.
>
> Try something roughly like this (not tested; typed in raw):

<snip>
>
> This will take some debugging.

I'll test it and let you know, then post the debugged code.
>
> In this approach above, one-byte "chars" are used for sorting.  An  
> alternative is to use two ASCII chars, space-char, for the ASCII  
> subset and two letters that sort right for the Cyrillic.  That  
> would make testing easier for russianLex().
>
> I remember from yesterday that yo or ye or something (two dots over  
> e) was not in the basic Russian group, so you will need to handle  
> it separate from the basic Russian range.

Right, the 'yo' character is at U+0451, after the UpperCase/LowerCase  
range, so it has to be handled as a special case.

> BTW, for those not familiar with using customFun(each) in sort,  
> customFun() seems to be called only once for each line; it is not  
> called twice for each comparison.
>
>
> I am not in favor of a Unicode sort option.  I'll elaborate later.   
> I have a couple goals to meet by tonight.

What about if it could be toggled on or off by expanding the effect  
of the useUnicode property?

Devin

Devin Asay
Humanities Technology and Research Support Center
Brigham Young University




More information about the use-livecode mailing list