Unicode sorting
Dar Scott
dsc at swcp.com
Sat May 27 16:05:03 EDT 2006
On May 27, 2006, at 9:12 AM, Devin Asay wrote:
>
> For the Russian (don't know if this will come thru in your email
> reader):
> Я вижу вас.
> The unicode is (omitting the "U+" convention):
> 042F 0020 0432 0438 0436 0443 0020 0432 0430 0441 002E
>
> But what rev is seeing during sort is a series of single byte
> chars, with leading null bytes of basic latin range chars ignored:
> 04 2F 20 04 32 04 38 04 36 04 43 20 04 32 04 30 04 41 2E
>
> Since all of the first bytes of the Cyrillic range are 04 (< 20),
> they are always sorted *before* virtually everything in lower ascii
> range.
The letters came through.
But the NUL characters are not dropped unless you are doing something
to drop them. But if they are dropped, then that will happen.
I see you have a period in your characters. Perhaps you have other
characters outside Russian Cyrillic.
I forgot about the line ends in sort. Those can come up in the
middle of Unicode characters in general.
I wonder if a sort of the utf8 of the Cyrillic to-lower would be
close. The idea below is probably better in general.
Try something roughly like this (not tested; typed in raw):
function sortRussian utf16RussianList
-- use utf8 to get rid of NULs and extra line ends
put uniDecode(utf16RussianList, "UTF8") into utf8RussianList
sort lines of utf8RussianList text by russianLex(each)
return utf8RussianList
end sortRussian
-- returns string suitable for lexical comparison (Rev sort text)
-- of a utf8 string made up of Russian subset of Cyrillic plus some
ASCII
function russianLex utf8RussianLine
-- Add adjustments for special words here
put uniEncode(utf8RussianLine, "UTF8") into utf16RussianLine
put empty into lex
repeat with i = 1 to length(utf16RussianLine)-1 step 2 -- uniCode
char loop
put char i to i+1 of utf16RussianLine into utf16RussianChar
-- Add char dropping tests here
put sortCodeFromRussianChar( utf16RussianChar) into sortNumber
put numTochar( sortNumber ) after lex -- use 1-byte chars for
sorting
end repeat
return lex
end russianLex
-- returns number in range 1 to 255 indicating sort position of
-- allowed characters
function sortCodeFromRussianChar utf16Char
set the useUnicode to true
put charToNum(utf16Char) into unicodePoint
switch unicodePoint
case 0x0020 -- space
get 1
break
...
default
get 255
end switch
return it
end sortCodeFromRussianChar
This will take some debugging.
In this approach above, one-byte "chars" are used for sorting. An
alternative is to use two ASCII chars, space-char, for the ASCII
subset and two letters that sort right for the Cyrillic. That would
make testing easier for russianLex().
I remember from yesterday that yo or ye or something (two dots over
e) was not in the basic Russian group, so you will need to handle it
separate from the basic Russian range.
BTW, for those not familiar with using customFun(each) in sort,
customFun() seems to be called only once for each line; it is not
called twice for each comparison.
I am not in favor of a Unicode sort option. I'll elaborate later. I
have a couple goals to meet by tonight.
Dar
More information about the use-livecode
mailing list