Unicode sorting
Devin Asay
devin_asay at byu.edu
Fri May 26 17:57:23 EDT 2006
Thanks, Dar. These tips will come in handy, and help confirm some of
the things I was already thinking. A 'sort lines' command, after
converting upper case to lower, works fairly well, except that,
curiously, a space sorts *after* all cyrillic chars. I'm sure that's
because rev is really doing an ascii sort on the unicode text, and
the first byte of each unicode character is < #0020. What we really
need is a sort ... unicode option to go along with sort ... text and
sort ... numeric.
Devin
On May 26, 2006, at 12:07 AM, Dar Scott wrote:
>
> On May 25, 2006, at 4:19 PM, Devin Asay wrote:
>
>> I have a need to sort long lists of Cyrillic unicode text
>> according to Russian alphabet order. Before I start writing my own
>> routine, has anyone figured out how to sort unicode text lists?
>
> Here are some hints:
>
> 1.
> Trick: If you are sorting strings with only characters from the
> same 256 character range, then byte-order doesn't matter when doing
> a lexical sort. For example, if all your characters are in the
> Cyrillic range of U+0400 to U+04FF, then you can use an ordinary
> byte character sort. However, if you have spaces (U+0020) then you
> will need to replace them with something else for sorting or make
> sure you have control over order.
>
> 2.
> If the high byte if the Unicode characters never looks like a digit
> then you can compare with < (probably not important if using 'sort').
>
> 3.
> The basic alphabet of a language is typically coded in roughly the
> order needed for sorting. That rough order may be just fine for
> your need.
>
> 4.
> Conversion from lower to upper or upper to lower for sorting is
> often just a bit-logic operation. However, since you usually have
> to do range checking, then adding or subtracting an offset works
> fine, too. If you know you have only basic upper and lower
> letters, doing the bit op every time is probably faster. This
> should work for a rough sort.
>
> 5.
> The basic alphabet of a language in unicode might include
> characters you don't use. That is OK as long as the ones you do
> use are coded in the right order. The holes don't matter.
>
> Dar
Devin Asay
Humanities Technology and Research Support Center
Brigham Young University
More information about the use-livecode
mailing list