Unicode sorting

Devin Asay devin_asay at byu.edu
Fri May 26 17:57:23 EDT 2006


Thanks, Dar. These tips will come in handy, and help confirm some of  
the things I was already thinking. A 'sort lines' command, after  
converting upper case to lower, works fairly well, except that,  
curiously, a space sorts *after* all cyrillic chars. I'm sure that's  
because rev is really doing an ascii sort on the unicode text, and  
the first byte of each unicode character is < #0020. What we really  
need is a sort ... unicode option to go along with sort ... text and  
sort ... numeric.

Devin

On May 26, 2006, at 12:07 AM, Dar Scott wrote:

>
> On May 25, 2006, at 4:19 PM, Devin Asay wrote:
>
>> I have a need to sort long lists of Cyrillic unicode text  
>> according to Russian alphabet order. Before I start writing my own  
>> routine, has anyone figured out how to sort unicode text lists?
>
> Here are some hints:
>
> 1.
> Trick:  If you are sorting strings with only characters from the  
> same 256 character range, then byte-order doesn't matter when doing  
> a lexical sort.  For example, if all your characters are in the  
> Cyrillic range of U+0400 to U+04FF, then you can use an ordinary  
> byte character sort.  However, if you have spaces (U+0020) then you  
> will need to replace them with something else for sorting or make  
> sure you have control over order.
>
> 2.
> If the high byte if the Unicode characters never looks like a digit  
> then you can compare with < (probably not important if using 'sort').
>
> 3.
> The basic alphabet of a language is typically coded in roughly the  
> order needed for sorting.  That rough order may be just fine for  
> your need.
>
> 4.
> Conversion from lower to upper or upper to lower for sorting is  
> often just a bit-logic operation.  However, since you usually have  
> to do range checking, then adding or subtracting an offset works  
> fine, too.  If you know you have only basic upper and lower  
> letters, doing the bit op every time is probably faster.  This  
> should work for a rough sort.
>
> 5.
> The basic alphabet of a language in unicode might include  
> characters you don't use.  That is OK as long as the ones you do  
> use are coded in the right order.  The holes don't matter.
>
> Dar

Devin Asay
Humanities Technology and Research Support Center
Brigham Young University




More information about the Use-livecode mailing list