Unicode mysteries

Neville Smythe neville.smythe at optusnet.com.au
Thu Mar 26 02:53:37 EDT 2020


I am trying to understand the mysteries of unicode encodings; the following may (or may not) be useful (or confusing) to others.

 The docs say the full chunk expression for a unicode character is
      byte i of codeunit j of codepoint k of character c  of str
(with the warning that this is 'not of general utility’ … indeed!)

Taking a look at the Emoji ‘flag of Scotland’ character 🏴󠁧󠁢󠁳󠁣󠁴󠁿 which won’t display here but exists in the Apple Color Emoji font and in corresponding fonts for other platforms, I get

put 🏴󠁧󠁢󠁳󠁣󠁴󠁿 into str
number of chars of str:	1

char 1 of str :	🏴󠁧󠁢󠁳󠁣󠁴󠁿	number of codepoints of char 1 of str:	7
     codepoint:1	1F3F4	with 2 codeunits (D83C DFF4)
     codepoint:2	0	        with 0 codeunits - seems to be a placeholder rather an actual codepoint
     codepoint:3	E0067	(DB40 DC67)
     codepoint:4	0	
     codepoint:5	E0062	(DB40 DC62)
     codepoint:6	0	
     codepoint:7	E0073	(DB40 DC73)

number of codepoints of str: 7
number of codeunits of str: 14
number of codeunits of char 1 of str: 14

So there are 6 codeunits which are not in any codepoints (or at least not as reported by LC). They can be enumerated by looping over “codeunit j of str” rather than 'codeunit j of codepoint k of ..' Or by textEnccode(str,”UTF-16”) and then by enumerating the bytes of the binary encoded str.

Bytes in binary encoding = all the codeunits (encoding is actually in littleendian byte order, but given here in bigendian order, which is the order reported by enumerating the codeunits)
       D83C DFF4 DB40 DC67 DB40 DC62 DB40 DC73 DB40 DC63 DB40 DC74 DB40 DC7F

Which should correspond to codepoints
       1F3F4 E0067 E0062 E0073 E0063 E0074 E007F
And indeed if I manually build a UTF-16 string with these code points it does display as the flag of Scotland. So the lesson is that the reported chunks are not to be naively trusted  --- tho not exactly a bug given the documentation warning.

1F3F4 by the way is a black flag; the remaining codepoints are in the Variations unicode block. Amusingly the Rainbow flag emoji  is made up of 3 characters, char 1 is a white flag, char 2 is an invisible vertical join instruction, char 3 is a rainbow. BTW, backspacing over the displayed Rainbow flag actually has to be done in three steps to remove the displayed glyph, which I think is not correct behaviour for an editor since it appears to the user as one unicode character. Apple's TextEdit for example deletes the Rainbow flag with a single backspace. There are nasties lurking here for text manipulation LC code. Perhaps there should be a new string element ‘unicodeChar’? BTW I have nothing but huge admiration for the LC unicode implementation team, it is a subject of extreme complexity.

Another question (which I think has been raised before but I don’t think there was an answer?). When a character (codepoint) in a string is displayed, if the requested font does not have that codepoint the OS substitutes a glyph from another font (or the missing character glyph if no font supports the codepoint). So for example if you change the font of the above flag of Scotland to Arial, it still displays as the flag of Scotland, even though this glyph is not in Arial. LC will still report that the font of this character is Arial: from what I can gather this is not the fault of LC, the OS is doing this substitution behind its back (TextEdit does the same). But is there any way to find out (programatically) the actual font being used? 
   
   





More information about the use-livecode mailing list