Unicode sorting

Dar Scott dsc at swcp.com
Sat May 27 16:05:03 EDT 2006

On May 27, 2006, at 9:12 AM, Devin Asay wrote:

> For the Russian (don't know if this will come thru in your email  
> reader):
>   Я вижу вас.
> The unicode is (omitting the "U+" convention):
>   042F 0020 0432 0438 0436 0443 0020 0432 0430 0441 002E
> But what rev is seeing during sort is a series of single byte  
> chars, with leading null bytes of basic latin range chars ignored:
>   04 2F 20 04 32 04 38 04 36 04 43 20 04 32 04 30 04 41 2E
> Since all of the first bytes of the Cyrillic range are 04 (< 20),  
> they are always sorted *before* virtually everything in lower ascii  
> range.

The letters came through.

But the NUL characters are not dropped unless you are doing something  
to drop them.  But if they are dropped, then that will happen.

I see you have a period in your characters.  Perhaps you have other  
characters outside Russian Cyrillic.

I forgot about the line ends in sort.  Those can come up in the  
middle of Unicode characters in general.

I wonder if a sort of the utf8 of the Cyrillic to-lower would be  
close.  The idea below is probably better in general.

Try something roughly like this (not tested; typed in raw):

function sortRussian utf16RussianList
    -- use utf8 to get rid of NULs and extra line ends
    put uniDecode(utf16RussianList, "UTF8") into utf8RussianList
    sort lines of utf8RussianList text by russianLex(each)
    return utf8RussianList
end sortRussian

-- returns string suitable for lexical comparison (Rev sort text)
-- of a utf8 string made up of Russian subset of Cyrillic plus some  
function russianLex utf8RussianLine
    -- Add adjustments for special words here
    put uniEncode(utf8RussianLine, "UTF8") into utf16RussianLine
    put empty into lex
    repeat with i = 1 to length(utf16RussianLine)-1 step 2 -- uniCode  
char loop
       put char i to i+1 of utf16RussianLine into utf16RussianChar
       -- Add char dropping tests here
       put sortCodeFromRussianChar( utf16RussianChar) into sortNumber
       put numTochar( sortNumber ) after lex -- use 1-byte chars for  
   end repeat
   return lex
end russianLex

-- returns number in range 1 to 255 indicating sort position of
-- allowed characters
function sortCodeFromRussianChar utf16Char
    set the useUnicode to true
    put charToNum(utf16Char) into unicodePoint
    switch unicodePoint
    case 0x0020 -- space
      get 1
      get 255
    end switch
    return it
end sortCodeFromRussianChar

This will take some debugging.

In this approach above, one-byte "chars" are used for sorting.  An  
alternative is to use two ASCII chars, space-char, for the ASCII  
subset and two letters that sort right for the Cyrillic.  That would  
make testing easier for russianLex().

I remember from yesterday that yo or ye or something (two dots over  
e) was not in the basic Russian group, so you will need to handle it  
separate from the basic Russian range.

BTW, for those not familiar with using customFun(each) in sort,  
customFun() seems to be called only once for each line; it is not  
called twice for each comparison.

I am not in favor of a Unicode sort option.  I'll elaborate later.  I  
have a couple goals to meet by tonight.


More information about the Use-livecode mailing list