Jumping cursors

Richmond Mathewson richmondmathewson at gmail.com
Thu Jan 5 06:23:25 EST 2017


Thank you: 373 wonky results!

Well, to be honest, I'm not going to wait for you and yours to sort that 
out; I shall use the list to help
me avoid wonky Unicode addresses.

On 1/5/17 1:07 pm, Mark Waddingham wrote:
> On 2017-01-05 11:01, Richmond Mathewson wrote:
>> Ha, Ha, Ha: possibly the first time ever that it hasn't been the 
>> latter :)
>>>
>>> http://quality.livecode.com/show_bug.cgi?id=19045
>>
>> By 'stupid engine' do you mean the LiveCode engine, something else, or
>> code that has been co-opted
>> from elsewhere and folded into the LC engine?
>
> Specifically the internal routine which fetches the Unicode 
> 'properties' for a run of characters is currently computing a 
> surrogate pair's codepoint incorrectly - in this case U+0FF001 is 
> being treated as U+07BC - which is an undefined codepoint and as such 
> the property info being fetched (in this case, BiDi class) is undefined.
>
>> I, like a fool, had assumed that post LiveCode 7.0 the engine was,
>> somehow, avoiding surrogate pairs
>> altogether, rather than fudging around so things were *very pleasant
>> indeed* for people like me when
>> leveraging glyphs occupying Unicode areas above the first plain.
>>
>> Obviously things were slightly too good to be true.
>
> The engine does 'automatically' deal with surrogate pairs in UTF-16. 
> Indeed, the fact that they exist at all in the engine's internal 
> representation is generally not something the developer has to worry 
> about (modulo bugs, like the one above).
>
> You can use the codeunit chunk to access a string's individual UTF-16 
> components, codepoint chunk to access a string as a sequence of actual 
> codepoints, and char to access a string as a sequence of graphemes 
> (approximation to what most people call 'letters' or 'characters').
>
>> Do you have any idea which other surrogate pairs it might be getting 
>> wrong?
>>
>> Until (if ?) things get sorted out that would be a useful reference
>> list so as to know which Unicode slots
>> to avoid.
>
> This should list all the codepoints in the SPUA-A which will cause 
> directionality problems (due to incorrect property lookup):
>
>    local tList
>    repeat with tCodepoint = 0xF0000 to 0xFFFFD
>       get numToCodepoint(tCodepoint)
>
>       local tLeading, tTrailing
>       put codepointToNum(codeunit 1 of it) into tLeading
>       put codepointToNum(codeunit 2 of it) into tTrailing
>
>       local tWrongCodepoint
>       put (tLeading - 0xD800) + ((tTrailing - 0xDC00)  * 2^10) into 
> tWrongCodepoint
>
>       get codepointProperty(numToCodepoint(tWrongCodepoint), "Bidi 
> Class")
>       if it contains "Right_To_Left" or it contains "Arabic" then
>          put format("U+0x%6x has wrong bidi class - %s\n", tCodepoint, 
> it) after tList
>       end if
>    end repeat
>    put tList

Anyone who wants to mess around with this (I am on a Macintosh at the 
moment) on Windows or Linux
can download this:

https://www.dropbox.com/s/i8ba0viztujs0dq/bad%20Unicode.livecode.zip?dl=0
>
>> Writing as a lazy slob I feel no screaming urge to go back and recode
>> all those (0x4FFF6), (0x3EEDA)
>> hex codes as surrogate pairs . . .
>
> Doing so wouldn't do you any good anyway. The bug lies in the 
> processing of the string *after* it has been constructed - whether it 
> is constructed directly from codepoints, or codeunits wouldn't make a 
> difference.
>
> I've submitted a PR for a fix to the problem against the 8.1 branch here:
>
>    https://github.com/livecode/livecode/pull/5020

Presumably that also holds forth for the LiveCode 9 series.
>
> Warmest Regards,
>
> Mark.
>

Best, Richmond.



More information about the use-livecode mailing list