the selectedText of a Unicode field

Slava Paperno slava.paperno at cornell.edu
Fri Jun 17 03:31:31 EDT 2011


Briefly, here is the problem:

put the selectedText of field "BilingualText" of this card into
tCurrSelection
 --the field may hold, for example, the two words Боб Bob, which is the
string assigned to the unicodeText prop. of the field

set the unicodeText of field "YourSelection" of this card to tCurrSelection

--alternatively:
set the unicodeText of field "YourSelection" of this card to
uniEncode(tCurrSelection, "UTF8")

Neither alternative works for bilingual text. One version works only for
English, the other works only for Russian. In each case, the other language
is unreadable in field "YourSelection".

In detail:

I'm working with bilingual texts (Russian and English) in various LC
contexts. The Russian portion is in Unicode. I found things that simply
cannot be used with Unicode, and also things that do work, or can be adapted
to work. But now I'm stuck with the selectedText and the selectedChunk
properties of a text entry field. Here's the script:

put the selectedText of field "BilingualText" of this card into
tCurrSelection
 --the field may hold, for example, Боб Bob; the value assigned to the
unicodeText prop. of the field
set the unicodeText of field "YourSelection" of this card to tCurrSelection

The above snippet works fine when the selected text in field "BilingualText"
is all Russian. When it is English, some Chinese characters are displayed in
field "YourSelection." The reason, as far as I understand, is that the
Russian text, when stored in variable tCurrSelection is already uniEncoded,
but the English text is not, so before it can be displayed in a Unicode
field, it has to be uniEncoded, like this:

put the selectedText of field "BilingualText" of this card into
tCurrSelection
set the unicodeText of field "YourSelection" of this card to
uniEncode(tCurrSelection, "UTF8")

Indeed, the above snippet works when the selected text is English, but it
displays non-readable text when it is Russian (because--I think--the text is
twice uniEncoded; I've learned to recognize degrees of "unreadability").

Using the selectedChunk property has the same problem.

When I try to examine the decimal code point, charToNum() of each character
of the selection and determine whether it is Roman or not, I run into the
same problem: if I know that the character I am testing is double-byte
Cyrillic, I have to use charToNum(char N to N+1) for each character. But if
I use that formula for an ANSI string, I get meaningless results (especially
useless for a string with an odd number of characters, like "Bob" because
char 3 to 4 of "Bob" returns empty). I thought that Roman letters would be
stored in the selectedText property as the combination of a null byte
followed by the ANSI code, but apparently that is not the case: "Bob" is
stored as three bytes, even when it is part of the selectedText of a Unicode
field whereas a Russian three-letter word is stored in the selectedText in 6
bytes. If this sounds wrong, then maybe I am wrong. I'd like to know.

I also tried examining the individual bytes in the string, byteToNum() but
that doesn't help either, because, for example, decimal 66 can be an ANSI
character or it can be the first byte of a double-byte Cyrillic letter.

I do know about the requirement to use "set useUnicode to true" for the
charToNum() to work.

Finally, I tried to examine a uniDecoded() version of the selectedText, and
also got nowhere.

Is there a solution to this conundrum? Am I missing something obvious? 

Thanks!

Slava






More information about the use-livecode mailing list