Jumping cursors

Mark Waddingham mark at livecode.com
Thu Jan 5 06:07:58 EST 2017


On 2017-01-05 11:01, Richmond Mathewson wrote:
> Ha, Ha, Ha: possibly the first time ever that it hasn't been the latter 
> :)
>> 
>> http://quality.livecode.com/show_bug.cgi?id=19045
> 
> By 'stupid engine' do you mean the LiveCode engine, something else, or
> code that has been co-opted
> from elsewhere and folded into the LC engine?

Specifically the internal routine which fetches the Unicode 'properties' 
for a run of characters is currently computing a surrogate pair's 
codepoint incorrectly - in this case U+0FF001 is being treated as U+07BC 
- which is an undefined codepoint and as such the property info being 
fetched (in this case, BiDi class) is undefined.

> I, like a fool, had assumed that post LiveCode 7.0 the engine was,
> somehow, avoiding surrogate pairs
> altogether, rather than fudging around so things were *very pleasant
> indeed* for people like me when
> leveraging glyphs occupying Unicode areas above the first plain.
> 
> Obviously things were slightly too good to be true.

The engine does 'automatically' deal with surrogate pairs in UTF-16. 
Indeed, the fact that they exist at all in the engine's internal 
representation is generally not something the developer has to worry 
about (modulo bugs, like the one above).

You can use the codeunit chunk to access a string's individual UTF-16 
components, codepoint chunk to access a string as a sequence of actual 
codepoints, and char to access a string as a sequence of graphemes 
(approximation to what most people call 'letters' or 'characters').

> Do you have any idea which other surrogate pairs it might be getting 
> wrong?
> 
> Until (if ?) things get sorted out that would be a useful reference
> list so as to know which Unicode slots
> to avoid.

This should list all the codepoints in the SPUA-A which will cause 
directionality problems (due to incorrect property lookup):

    local tList
    repeat with tCodepoint = 0xF0000 to 0xFFFFD
       get numToCodepoint(tCodepoint)

       local tLeading, tTrailing
       put codepointToNum(codeunit 1 of it) into tLeading
       put codepointToNum(codeunit 2 of it) into tTrailing

       local tWrongCodepoint
       put (tLeading - 0xD800) + ((tTrailing - 0xDC00)  * 2^10) into 
tWrongCodepoint

       get codepointProperty(numToCodepoint(tWrongCodepoint), "Bidi 
Class")
       if it contains "Right_To_Left" or it contains "Arabic" then
          put format("U+0x%6x has wrong bidi class - %s\n", tCodepoint, 
it) after tList
       end if
    end repeat
    put tList

> Writing as a lazy slob I feel no screaming urge to go back and recode
> all those (0x4FFF6), (0x3EEDA)
> hex codes as surrogate pairs . . .

Doing so wouldn't do you any good anyway. The bug lies in the processing 
of the string *after* it has been constructed - whether it is 
constructed directly from codepoints, or codeunits wouldn't make a 
difference.

I've submitted a PR for a fix to the problem against the 8.1 branch 
here:

    https://github.com/livecode/livecode/pull/5020

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list