Unicode sorting

Dar Scott dsc at swcp.com
Fri May 26 02:07:46 EDT 2006


On May 25, 2006, at 4:19 PM, Devin Asay wrote:

> I have a need to sort long lists of Cyrillic unicode text according  
> to Russian alphabet order. Before I start writing my own routine,  
> has anyone figured out how to sort unicode text lists?

Here are some hints:

1.
Trick:  If you are sorting strings with only characters from the same  
256 character range, then byte-order doesn't matter when doing a  
lexical sort.  For example, if all your characters are in the  
Cyrillic range of U+0400 to U+04FF, then you can use an ordinary byte  
character sort.  However, if you have spaces (U+0020) then you will  
need to replace them with something else for sorting or make sure you  
have control over order.

2.
If the high byte if the Unicode characters never looks like a digit  
then you can compare with < (probably not important if using 'sort').

3.
The basic alphabet of a language is typically coded in roughly the  
order needed for sorting.  That rough order may be just fine for your  
need.

4.
Conversion from lower to upper or upper to lower for sorting is often  
just a bit-logic operation.  However, since you usually have to do  
range checking, then adding or subtracting an offset works fine,  
too.  If you know you have only basic upper and lower letters, doing  
the bit op every time is probably faster.  This should work for a  
rough sort.

5.
The basic alphabet of a language in unicode might include characters  
you don't use.  That is OK as long as the ones you do use are coded  
in the right order.  The holes don't matter.

Dar



More information about the use-livecode mailing list