Unicode sorting

Devin Asay devin_asay at byu.edu
Tue May 30 19:08:59 EDT 2006


Dar,

I got your code to work by making some simple changes in the  
sortCodeFromRussian function:

function sortCodeFromRussianChar utf16Char
   set the useUnicode to true
   put charToNum(utf16Char) into unicodePoint

## Devin's changes - it turns out leaving the code points in decimal  
works perfectly,
##  and I only had to make a couple of adjustments.
   if unicodePoint > 1039 and unicodePoint < 1072 then -- ignore case
     add 32 to unicodePoint
   else if unicodePoint = 1105 then -- sort 'yo' with 'ye'
     put 1077 into unicodePoint
   end if
##
   --   switch unicodePoint
   --   case 0x0020 -- space
   --     get 1
   --     break
   --   ...
   --   default
   --     get 255
   --   end switch
   return unicodePoint --it
end sortCodeFromRussianChar


On May 27, 2006, at 2:05 PM, Dar Scott wrote:

> Try something roughly like this (not tested; typed in raw):
>
> function sortRussian utf16RussianList
>    -- use utf8 to get rid of NULs and extra line ends
>    put uniDecode(utf16RussianList, "UTF8") into utf8RussianList
>    sort lines of utf8RussianList text by russianLex(each)
>    return utf8RussianList
> end sortRussian
>
> -- returns string suitable for lexical comparison (Rev sort text)
> -- of a utf8 string made up of Russian subset of Cyrillic plus some  
> ASCII
> function russianLex utf8RussianLine
>    -- Add adjustments for special words here
>    put uniEncode(utf8RussianLine, "UTF8") into utf16RussianLine
>    put empty into lex
>    repeat with i = 1 to length(utf16RussianLine)-1 step 2 --  
> uniCode char loop
>       put char i to i+1 of utf16RussianLine into utf16RussianChar
>       -- Add char dropping tests here
>       put sortCodeFromRussianChar( utf16RussianChar) into sortNumber
>       put numTochar( sortNumber ) after lex -- use 1-byte chars for  
> sorting
>   end repeat
>   return lex
> end russianLex
>
> -- returns number in range 1 to 255 indicating sort position of
> -- allowed characters
> function sortCodeFromRussianChar utf16Char
>    set the useUnicode to true
>    put charToNum(utf16Char) into unicodePoint
>    switch unicodePoint
>    case 0x0020 -- space
>      get 1
>      break
>    ...
>    default
>      get 255
>    end switch
>    return it
> end sortCodeFromRussianChar
>
> This will take some debugging.

Only a little. ;-)

This is a huge help! Thanks a million.

Devin

Devin Asay
Humanities Technology and Research Support Center
Brigham Young University




More information about the use-livecode mailing list