Unicode sorting

Devin Asay devin_asay at byu.edu
Fri Jun 2 11:45:39 EDT 2006


Okay, Dar, I tried your idea. It works like a dream, at least for my  
problem (Cyrillic range unicode). I didn't even have to convert upper  
to lower case! In fact, I'm not even sure exactly why this works.

On Jun 1, 2006, at 5:21 PM, Dar Scott wrote:

> Wow!  Great news for sorting Unicode!
>
> On May 30, 2006, at 5:08 PM, Devin Asay wrote:
>
>> I got your code to work by making some simple changes in the  
>> sortCodeFromRussian function:
>
> Deven, I've been processing some bits of UTF-8, and something  
> dawned on me that is probably known by the Unicode experts.
>
>  **** A lexical byte sort of well-formed UTF-8 will result in a  
> Unicode code point sort!  *****

Here's what I did:

The call:

   set the unicodeText of fld 1 to uniencode(sortRussian(the  
unicodeText of fld 1),"utf8")

The function:

function sortRuss utf16RussList
   put uniDecode(utf16RussList, "UTF8") into utf8RussList
   sort lines of utf8RussList text
   return utf8RussianList
end sortRuss

>
> That avoids the NUL problem in sort.  That means that russianLex()  
> can return the UTF-8 of the string with your character conversions.
>
> I think the replace command will work with UTF-8, so you can even  
> avoid a character loop.  All you need is 34 replaces and then a  
> return.  OK, that might actually be slower than a character loop.

FWIW, at first I did do a UC > LC conversion. The replaces were very  
fast. Less than 1 second on a list of > 1350 unicode lines. Just a  
list of 35 replaces.

function russToLC lList
   replace "А" with "а" in lList
   replace "Б" with "б" in lList
   replace "В" with "в" in lList
   replace "Г" with "г" in lList
   replace "Д" with "д" in lList
   replace "Е" with "е" in lList
   replace "Ё" with "е" in lList -- convert "yo" to "ye"
   replace "ё" with "е" in lList
   replace "Ж" with "ж" in lList
   replace "З" with "з" in lList
   replace "И" with "и" in lList
   replace "Й" with "й" in lList
   replace "К" with "к" in lList
   replace "Л" with "л" in lList
   replace "М" with "м" in lList
   replace "Н" with "н" in lList
   replace "О" with "о" in lList
   replace "П" with "п" in lList
   replace "Р" with "р" in lList
   replace "С" with "с" in lList
   replace ""&quote with "т" in lList --U.C. Russ T has #0022 as  
byte 2 (= ascii quote char)
   replace "У" with "у" in lList
   replace "Ф" with "ф" in lList
   replace "Х" with "х" in lList
   replace "Ц" with "ц" in lList
   replace "Ч" with "в" in lList
   replace "Ш" with "в" in lList
   replace "Щ" with "в" in lList
   replace "Ъ" with "в" in lList
   replace "Ы" with "в" in lList
   replace "Ь" with "в" in lList
   replace "Э" with "в" in lList
   replace "Ю" with "в" in lList
   replace "Я" with "я" in lList
   return lList
end russToLC



>
> Dar
> Unicode Sophomore

Devin
Still in Unicode Prep School

Devin Asay
Humanities Technology and Research Support Center
Brigham Young University




More information about the use-livecode mailing list