Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)

Thu Dec 28 12:45:35 EST 2017

On 2017-12-19 19:43, Mark Waddingham via use-livecode wrote:
> I'm pretty sure it would be possible to write a handler which takes
> the styledText array of a field in 6.7.11 and a list of old indicies,
> returning a list of new char indicies... Would that help?

Paul expressed an interest in how this might work - and he provided some 
more background:

-*-

Our main application, HyperRESEARCH, a tool for academics and others
doing qualitative research, relies completely on chunk ranges. It is
essentially a bookmarking tool where users can select some content from
a document, the character position (chunk) is grabbed and the user gives
it a text label and HyperRESEARCH remembers that label "Early Childhood
Behavior X" points to char S to T of document "ABC". All documents,
native text, unicode (utf8 or utf16), rtf, docx, odt, etc. are read into
a LiveCode field, from which the selection is made and the chunk
obtained. HyoperRESEARCH saves a "Study" file that contains a LOT of
these labels and chunks and documents names.

As part of our migration from LC464, which is what the current release
of HyperRESEARCH is based on, we needed a way to convert a character
range created under LC4.6.4 to a range under LC6.7.11 that point to the
exact same string for the same file. Curry Kenworthy, whose libraries we
license for reading MS-Word and Open Office documents built a library
based on an algorithm I came up with to send the original LC464 ranges
to a helper application using sockets or IPC. The helper application
retrieves the strings associated with each chunk, strips white space and
passes the string back to the LC6.7.11 version of the main app, which
then finds the whitespace stripped strings in the same file loaded under
LC6.7.11 with an indexing mechanism to adjust the positions for the
stripped whitespace. It is a bit complicated, but it works reliably.

-*-

 From this I infer the following:

1) The study file is a list of triples - label, char chunk, document 
filename

2) When using the study file, the original document is loaded into a 
field, and the char chunks are used to display labels which the user can 
jump to.

3) The char chunks are old-style (pre-5.5) byte indicies not codeunit 
indicies

The crux of the problem Paul is having comes down to (3) which has some 
background to explain.

Before 7.0, the field was the only part of the engine which naturally 
handled Unicode. In these older versions the field stored text as mixed 
sequence of style runs of either single bytes (native text) or double 
bytes (unicode text).

Between 5.5 and 7.0, indicies used when referencing chars in fields 
corresponded to codeunits - this meant that the indicies were 
independent of the encoding of the runs in the field. In this case char 
N referred to the Nth codeunit in the field, whether up until that point 
was all unicode, all native or a mixture of both.

Before 5.5, indicies used when referencing chars in fields corresponded 
to bytes - this meant that you had to take into account the encoding of 
the runs in the field. In this case, char N referred to the Nth byte in 
the field. So if your field had:

  abcXYZabc (where XYZ are two byte unicode chars)

Then char 4 would refer to the first byte of the X unicode char and 
*not* the two bytes it would have actually taken up.

Now, importantly, the internal structure of the field did not change 
between 4.6.4 and 5.5, just how the 'char' chunk worked - in 6.7.11, the 
internal structure of the field is still the mixed runs of 
unicode/native bytes just as it was in 4.6.4 - the only difference is 
what happens if you reference char X to Y of the field.

So solving this problem comes down to finding a means to 'get at' the 
internal encoding style runs of a field in 6.7.11. We want a handler:

   mapByteRangeToCharRange(pFieldId, pByteFrom, pByteTo)

Returning a pair pCharFrom, pCharTo - where pByteFrom, pByteTo are a 
char X to Y range from 4.6.4 and pCharFrom, pCharTo are a char X to Y 
range *for the same range* in 6.7.11.

-*-

Before going into the details, an easy way to see the internal mixed 
encoding of a field containing unicode in 6.7.11, is to put some text 
which is a mixture of native text and unicode text in a field and then 
look at its 'text' property. Putting:

Лорем ипсум Lorem ipsum dolor sit amet, pr долор сит амет, вел татион 
игнота сцрибентур еи. Вих еа феугиат doctus necessitatibus ассентиор 
пхилосопхиа. Феугаитconsulatu disputando comprehensam  вивендум вис ет, 
мел еррем малорум ат. Хас но видерер лобортис, suscipit detraxit 
interesset eum аппетере инсоленс салутатус усу не. Еи дуо лудус яуаеяуе, 
ет елитр цорпора пер.

Into a 6.7.11 field and then doing 'put the text of field 1' gives:

????? ????? Lorem ipsum dolor sit amet, pr ????? ??? ????, ??? ?????? 
?????? ?????????? ??. ??? ?? ??????? doctus necessitatibus ????????? 
???????????. ???????consulatu disputando comprehensam  ???????? ??? ??, 
??? ????? ??????? ??. ??? ?? ??????? ????????, suscipit detraxit 
interesset eum ???????? ???????? ????????? ??? ??. ?? ??? ????? ???????, 
?? ????? ??????? ???.

Here we see some of how 6.7.11 fields handled unicode. The '?' indicate 
that the 'char' being fetched at that index is a unicode codeunit (i.e. 
not representable in the native encoding). It is relatively easy to see 
by inspection that these match up quite easily - for each cyrillic 
letter there is a '?', and the roman letters come through directly.

In contrast if I do the same thing with 4.6.4, I get this:

??>?@?5?<? 8???A?C?<? Lorem ipsum dolor sit amet, pr 4?>?;?>?@? A?8?B? 
0?<?5?B?, 2?5?;? B?0?B?8?>?=? 8?3?=?>?B?0? A?F?@?8?1?5?=?B?C?@? 5?8?. 
??8?E? 5?0? D?5?C?3?8?0?B? doctus necessitatibus 0?A?A?5?=?B?8?>?@? 
??E?8?;?>?A?>???E?8?0?. $?5?C?3?0?8?B?consulatu disputando comprehensam  
2?8?2?5?=?4?C?<? 2?8?A? 5?B?, <?5?;? 5?@?@?5?<? <?0?;?>?@?C?<? 0?B?. 
%?0?A? =?>? 2?8?4?5?@?5?@? ;?>?1?>?@?B?8?A?, suscipit detraxit 
interesset eum 0?????5?B?5?@?5? 8?=?A?>?;?5?=?A? A?0?;?C?B?0?B?C?A? 
C?A?C? =?5?. ??8? 4?C?>? ;?C?4?C?A? O?C?0?5?O?C?5?, 5?B? 5?;?8?B?@? 
F?>?@???>?@?0? ??5?@?.

In order to make sure this came through vaguely sanely, I've replaced 
all bytes < 32 with ?. If you compare with 6.7.11 output you can see 
that for each '?' present in 'the text' of the 6.7.11 field, there are 
*two* chars in the 4.6.4 output:

     Лорем (original) -> ????? (6.7.11) -> ??>?@?5?<? (4.6.4)

This shows quite clearly the difference between 4.6.4 and 6.7.11 in 
handling text/char ranges - in 6.7.11 whilst internally each unicode 
codeunit takes up two bytes you don't see that, instead you see only a 
single 'char'. In comparison in 4.6.4, all the gory details are laid 
bare - you see the individual bytes making up the unicode codeunits.

-*-

Now, the above is only a rough way to see the internals of the field - 
the ? char in any one place in the text could be an actual '?' or a '?' 
which comes about because there is a non-native codeunit there. However, 
you can tell the encoding of any one char in a field by looking at the 
'encoding' property of the char.

    put the encoding of char 1 of field 1 -> unicode
    put the encoding of char 30 of field 1 -> native

We can use this information (in 6.7.11) to implement the required 
handler (which uses an auxillary handler to map one index):

-- Map a 4.6.4 char (byte) range to a 5.5+ char range.
function mapByteRangeToCharRange pFieldId, pByteFrom, pByteTo
    -- Convert the index of the beginning of the range.
    local tCharFrom
    put mapByteIndexToCharIndex(pFieldId, pByteFrom) into tCharFrom

    -- Convert the index of the end of the range. We add 1 to the end
    -- offset here so that we find the index of the char after the end
    -- char. We need to do this as the byte range of a single unicode
    -- char is always 2 bytes long.
    local tCharTo
    put mapByteIndexToCharIndex(pFieldId, pByteTo + 1) into tCharTo

    -- If the range is a singleton, charFrom and charTo will be the
    -- same.
    if tCharFrom is tCharTo then
       return tCharFrom,tCharTo
    end if

    -- Otherwise it is a multi-char range, and tCharTo will actually
    -- be the char after the end of the range (due to the adjustment
    -- above).
    return tCharFrom, tCharTo - 1
end mapByteRangeToCharRange

-- Map a 4.6.4 char (byte) offset to a 5.5+ char offset.
private function mapByteIndexToCharIndex pFieldId, pByteIndex
    -- Char indicies start from 1
    local tCharIndex
    put 1 into tCharIndex

    -- We iterate over the 5.5+ notion of chars until the original 4.6.4
    -- byte index is exhausted.
    repeat while pByteIndex > 1
       -- If the encoding of the char at the 5.5+ index is native, then 
it
       -- will have required 1 byte in 4.6.4; otherwise it will have 
required
       -- 2 bytes in 4.6.4.
       if the encoding of char tCharIndex of pFieldId is "native" then
          subtract 1 from pByteIndex
       else
          subtract 2 from pByteIndex
       end if
       -- We've consumed a single 5.5+ char, and either 1 or 2 4.6.4
       -- bytes at this point.
       add 1 to tCharIndex
    end repeat

    -- The final char index we computed is the char corresponding to
    -- the byte index in 4.6.4.
    return tCharIndex
end mapByteIndexToCharIndex

Now, this isn't the most efficient method of doing it - for example, you 
could scan from the start offset to the end offset rather than from the 
beginning again; or use the styledText array of the field which gives 
you the encoding of each style run in the field - this would save the 
by-char lookup. Perhaps an interesting exercise to see how fast it can 
be made?

-*-

So this is the solution for 4.6.4->6.7.11. In 7+ the internal structure 
of the field *did* change, it moved to using a string for each paragraph 
rather than a mixed style-run approach - i.e. the internal data 
structure for each paragraph is either a unicode string or a native 
string (although you can't tell the difference in 7 as that's an 
internal detail). In order for the approach to work in 7.x, the 4.6.4 
internal structure would need to be recreated from the text of the 
field. This is definitely possible to do - basically the approach 4.6.4 
used was to convert all chars it could to native, leaving the rest as 
unicode. So:

   xxxXyZwww (uppercase are unicode only chars, lowercase are 
can-be-native unicode chars)

Would end up with:

   xxx - native
   X - unicode
   y - native
   Z - unicode
   www - native

Once split up like this, rather than accessing the encoding property of 
the field you would use the encoding derived by splitting up the text 
content field in the above manner.

-*-

Of course, having said that (and testing in 7.0) - the encoding property 
of char ranges in the field should probably return 'unicode' for unicode 
only chars, and native for can-be-native chars. I'd need to look into 
why it doesn't currently - but if it did, I *think* the above code would 
work in 7+ as well as 5.5+. (I've filed 
http://quality.livecode.com/show_bug.cgi?id=20811 so I don't forget to 
have a look!).

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps