Unicode sorting
Devin Asay
devin_asay at byu.edu
Fri Jun 2 11:45:39 EDT 2006
Okay, Dar, I tried your idea. It works like a dream, at least for my
problem (Cyrillic range unicode). I didn't even have to convert upper
to lower case! In fact, I'm not even sure exactly why this works.
On Jun 1, 2006, at 5:21 PM, Dar Scott wrote:
> Wow! Great news for sorting Unicode!
>
> On May 30, 2006, at 5:08 PM, Devin Asay wrote:
>
>> I got your code to work by making some simple changes in the
>> sortCodeFromRussian function:
>
> Deven, I've been processing some bits of UTF-8, and something
> dawned on me that is probably known by the Unicode experts.
>
> **** A lexical byte sort of well-formed UTF-8 will result in a
> Unicode code point sort! *****
Here's what I did:
The call:
set the unicodeText of fld 1 to uniencode(sortRussian(the
unicodeText of fld 1),"utf8")
The function:
function sortRuss utf16RussList
put uniDecode(utf16RussList, "UTF8") into utf8RussList
sort lines of utf8RussList text
return utf8RussianList
end sortRuss
>
> That avoids the NUL problem in sort. That means that russianLex()
> can return the UTF-8 of the string with your character conversions.
>
> I think the replace command will work with UTF-8, so you can even
> avoid a character loop. All you need is 34 replaces and then a
> return. OK, that might actually be slower than a character loop.
FWIW, at first I did do a UC > LC conversion. The replaces were very
fast. Less than 1 second on a list of > 1350 unicode lines. Just a
list of 35 replaces.
function russToLC lList
replace "А" with "а" in lList
replace "Б" with "б" in lList
replace "В" with "в" in lList
replace "Г" with "г" in lList
replace "Д" with "д" in lList
replace "Е" with "е" in lList
replace "Ё" with "е" in lList -- convert "yo" to "ye"
replace "ё" with "е" in lList
replace "Ж" with "ж" in lList
replace "З" with "з" in lList
replace "И" with "и" in lList
replace "Й" with "й" in lList
replace "К" with "к" in lList
replace "Л" with "л" in lList
replace "М" with "м" in lList
replace "Н" with "н" in lList
replace "О" with "о" in lList
replace "П" with "п" in lList
replace "Р" with "р" in lList
replace "С" with "с" in lList
replace """e with "т" in lList --U.C. Russ T has #0022 as
byte 2 (= ascii quote char)
replace "У" with "у" in lList
replace "Ф" with "ф" in lList
replace "Х" with "х" in lList
replace "Ц" with "ц" in lList
replace "Ч" with "в" in lList
replace "Ш" with "в" in lList
replace "Щ" with "в" in lList
replace "Ъ" with "в" in lList
replace "Ы" with "в" in lList
replace "Ь" with "в" in lList
replace "Э" with "в" in lList
replace "Ю" with "в" in lList
replace "Я" with "я" in lList
return lList
end russToLC
>
> Dar
> Unicode Sophomore
Devin
Still in Unicode Prep School
Devin Asay
Humanities Technology and Research Support Center
Brigham Young University
More information about the use-livecode
mailing list