Unicode sorting
Dar Scott
dsc at swcp.com
Fri May 26 02:07:46 EDT 2006
On May 25, 2006, at 4:19 PM, Devin Asay wrote:
> I have a need to sort long lists of Cyrillic unicode text according
> to Russian alphabet order. Before I start writing my own routine,
> has anyone figured out how to sort unicode text lists?
Here are some hints:
1.
Trick: If you are sorting strings with only characters from the same
256 character range, then byte-order doesn't matter when doing a
lexical sort. For example, if all your characters are in the
Cyrillic range of U+0400 to U+04FF, then you can use an ordinary byte
character sort. However, if you have spaces (U+0020) then you will
need to replace them with something else for sorting or make sure you
have control over order.
2.
If the high byte if the Unicode characters never looks like a digit
then you can compare with < (probably not important if using 'sort').
3.
The basic alphabet of a language is typically coded in roughly the
order needed for sorting. That rough order may be just fine for your
need.
4.
Conversion from lower to upper or upper to lower for sorting is often
just a bit-logic operation. However, since you usually have to do
range checking, then adding or subtracting an offset works fine,
too. If you know you have only basic upper and lower letters, doing
the bit op every time is probably faster. This should work for a
rough sort.
5.
The basic alphabet of a language in unicode might include characters
you don't use. That is OK as long as the ones you do use are coded
in the right order. The holes don't matter.
Dar
More information about the use-livecode
mailing list