Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)

Thu Dec 28 20:01:49 EST 2017

Mark,

Thank you so much!!!!

On 12/28/2017 12:45 PM, Mark Waddingham via use-livecode wrote:
> On 2017-12-19 19:43, Mark Waddingham via use-livecode wrote:
>> I'm pretty sure it would be possible to write a handler which takes
>> the styledText array of a field in 6.7.11 and a list of old indicies,
>> returning a list of new char indicies... Would that help?
>
> Paul expressed an interest in how this might work - and he provided
> some more background:
>
> -*-
>
> Our main application, HyperRESEARCH, a tool for academics and others
> doing qualitative research, relies completely on chunk ranges. It is
> essentially a bookmarking tool where users can select some content from
> a document, the character position (chunk) is grabbed and the user gives
> it a text label and HyperRESEARCH remembers that label "Early Childhood
> Behavior X" points to char S to T of document "ABC". All documents,
> native text, unicode (utf8 or utf16), rtf, docx, odt, etc. are read into
> a LiveCode field, from which the selection is made and the chunk
> obtained. HyoperRESEARCH saves a "Study" file that contains a LOT of
> these labels and chunks and documents names.
>
> As part of our migration from LC464, which is what the current release
> of HyperRESEARCH is based on, we needed a way to convert a character
> range created under LC4.6.4 to a range under LC6.7.11 that point to the
> exact same string for the same file. Curry Kenworthy, whose libraries we
> license for reading MS-Word and Open Office documents built a library
> based on an algorithm I came up with to send the original LC464 ranges
> to a helper application using sockets or IPC. The helper application
> retrieves the strings associated with each chunk, strips white space and
> passes the string back to the LC6.7.11 version of the main app, which
> then finds the whitespace stripped strings in the same file loaded under
> LC6.7.11 with an indexing mechanism to adjust the positions for the
> stripped whitespace. It is a bit complicated, but it works reliably.
>
> -*-
>
> From this I infer the following:
>
> 1) The study file is a list of triples - label, char chunk, document
> filename
>
> 2) When using the study file, the original document is loaded into a
> field, and the char chunks are used to display labels which the user
> can jump to.
>
> 3) The char chunks are old-style (pre-5.5) byte indicies not codeunit
> indicies
>
> The crux of the problem Paul is having comes down to (3) which has
> some background to explain.
>
> Before 7.0, the field was the only part of the engine which naturally
> handled Unicode. In these older versions the field stored text as
> mixed sequence of style runs of either single bytes (native text) or
> double bytes (unicode text).
>
> Between 5.5 and 7.0, indicies used when referencing chars in fields
> corresponded to codeunits - this meant that the indicies were
> independent of the encoding of the runs in the field. In this case
> char N referred to the Nth codeunit in the field, whether up until
> that point was all unicode, all native or a mixture of both.
>
> Before 5.5, indicies used when referencing chars in fields
> corresponded to bytes - this meant that you had to take into account
> the encoding of the runs in the field. In this case, char N referred
> to the Nth byte in the field. So if your field had:
>
>  abcXYZabc (where XYZ are two byte unicode chars)
>
> Then char 4 would refer to the first byte of the X unicode char and
> *not* the two bytes it would have actually taken up.
>
> Now, importantly, the internal structure of the field did not change
> between 4.6.4 and 5.5, just how the 'char' chunk worked - in 6.7.11,
> the internal structure of the field is still the mixed runs of
> unicode/native bytes just as it was in 4.6.4 - the only difference is
> what happens if you reference char X to Y of the field.
>
> So solving this problem comes down to finding a means to 'get at' the
> internal encoding style runs of a field in 6.7.11. We want a handler:
>
>   mapByteRangeToCharRange(pFieldId, pByteFrom, pByteTo)
>
> Returning a pair pCharFrom, pCharTo - where pByteFrom, pByteTo are a
> char X to Y range from 4.6.4 and pCharFrom, pCharTo are a char X to Y
> range *for the same range* in 6.7.11.
>
> -*-
>
> Before going into the details, an easy way to see the internal mixed
> encoding of a field containing unicode in 6.7.11, is to put some text
> which is a mixture of native text and unicode text in a field and then
> look at its 'text' property. Putting:
>
> Лорем ипсум Lorem ipsum dolor sit amet, pr долор сит амет, вел татион
> игнота сцрибентур еи. Вих еа феугиат doctus necessitatibus ассентиор
> пхилосопхиа. Феугаитconsulatu disputando comprehensam  вивендум вис
> ет, мел еррем малорум ат. Хас но видерер лобортис, suscipit detraxit
> interesset eum аппетере инсоленс салутатус усу не. Еи дуо лудус
> яуаеяуе, ет елитр цорпора пер.
>
> Into a 6.7.11 field and then doing 'put the text of field 1' gives:
>
> ????? ????? Lorem ipsum dolor sit amet, pr ????? ??? ????, ??? ??????
> ?????? ?????????? ??. ??? ?? ??????? doctus necessitatibus ?????????
> ???????????. ???????consulatu disputando comprehensam  ???????? ???
> ??, ??? ????? ??????? ??. ??? ?? ??????? ????????, suscipit detraxit
> interesset eum ???????? ???????? ????????? ??? ??. ?? ??? ?????
> ???????, ?? ????? ??????? ???.
>
> Here we see some of how 6.7.11 fields handled unicode. The '?'
> indicate that the 'char' being fetched at that index is a unicode
> codeunit (i.e. not representable in the native encoding). It is
> relatively easy to see by inspection that these match up quite easily
> - for each cyrillic letter there is a '?', and the roman letters come
> through directly.
>
> In contrast if I do the same thing with 4.6.4, I get this:
>
> ??>?@?5?<? 8???A?C?<? Lorem ipsum dolor sit amet, pr 4?>?;?>?@? A?8?B?
> 0?<?5?B?, 2?5?;? B?0?B?8?>?=? 8?3?=?>?B?0? A?F?@?8?1?5?=?B?C?@? 5?8?.
> ??8?E? 5?0? D?5?C?3?8?0?B? doctus necessitatibus 0?A?A?5?=?B?8?>?@?
> ??E?8?;?>?A?>???E?8?0?. $?5?C?3?0?8?B?consulatu disputando
> comprehensam  2?8?2?5?=?4?C?<? 2?8?A? 5?B?, <?5?;? 5?@?@?5?<?
> <?0?;?>?@?C?<? 0?B?. %?0?A? =?>? 2?8?4?5?@?5?@? ;?>?1?>?@?B?8?A?,
> suscipit detraxit interesset eum 0?????5?B?5?@?5? 8?=?A?>?;?5?=?A?
> A?0?;?C?B?0?B?C?A? C?A?C? =?5?. ??8? 4?C?>? ;?C?4?C?A? O?C?0?5?O?C?5?,
> 5?B? 5?;?8?B?@? F?>?@???>?@?0? ??5?@?.
>
> In order to make sure this came through vaguely sanely, I've replaced
> all bytes < 32 with ?. If you compare with 6.7.11 output you can see
> that for each '?' present in 'the text' of the 6.7.11 field, there are
> *two* chars in the 4.6.4 output:
>
>     Лорем (original) -> ????? (6.7.11) -> ??>?@?5?<? (4.6.4)
>
> This shows quite clearly the difference between 4.6.4 and 6.7.11 in
> handling text/char ranges - in 6.7.11 whilst internally each unicode
> codeunit takes up two bytes you don't see that, instead you see only a
> single 'char'. In comparison in 4.6.4, all the gory details are laid
> bare - you see the individual bytes making up the unicode codeunits.
>
> -*-
>
> Now, the above is only a rough way to see the internals of the field -
> the ? char in any one place in the text could be an actual '?' or a
> '?' which comes about because there is a non-native codeunit there.
> However, you can tell the encoding of any one char in a field by
> looking at the 'encoding' property of the char.
>
>    put the encoding of char 1 of field 1 -> unicode
>    put the encoding of char 30 of field 1 -> native
>
> We can use this information (in 6.7.11) to implement the required
> handler (which uses an auxillary handler to map one index):
>
> -- Map a 4.6.4 char (byte) range to a 5.5+ char range.
> function mapByteRangeToCharRange pFieldId, pByteFrom, pByteTo
>    -- Convert the index of the beginning of the range.
>    local tCharFrom
>    put mapByteIndexToCharIndex(pFieldId, pByteFrom) into tCharFrom
>
>    -- Convert the index of the end of the range. We add 1 to the end
>    -- offset here so that we find the index of the char after the end
>    -- char. We need to do this as the byte range of a single unicode
>    -- char is always 2 bytes long.
>    local tCharTo
>    put mapByteIndexToCharIndex(pFieldId, pByteTo + 1) into tCharTo
>
>    -- If the range is a singleton, charFrom and charTo will be the
>    -- same.
>    if tCharFrom is tCharTo then
>       return tCharFrom,tCharTo
>    end if
>
>    -- Otherwise it is a multi-char range, and tCharTo will actually
>    -- be the char after the end of the range (due to the adjustment
>    -- above).
>    return tCharFrom, tCharTo - 1
> end mapByteRangeToCharRange
>
> -- Map a 4.6.4 char (byte) offset to a 5.5+ char offset.
> private function mapByteIndexToCharIndex pFieldId, pByteIndex
>    -- Char indicies start from 1
>    local tCharIndex
>    put 1 into tCharIndex
>
>    -- We iterate over the 5.5+ notion of chars until the original 4.6.4
>    -- byte index is exhausted.
>    repeat while pByteIndex > 1
>       -- If the encoding of the char at the 5.5+ index is native, then it
>       -- will have required 1 byte in 4.6.4; otherwise it will have
> required
>       -- 2 bytes in 4.6.4.
>       if the encoding of char tCharIndex of pFieldId is "native" then
>          subtract 1 from pByteIndex
>       else
>          subtract 2 from pByteIndex
>       end if
>       -- We've consumed a single 5.5+ char, and either 1 or 2 4.6.4
>       -- bytes at this point.
>       add 1 to tCharIndex
>    end repeat
>
>    -- The final char index we computed is the char corresponding to
>    -- the byte index in 4.6.4.
>    return tCharIndex
> end mapByteIndexToCharIndex
>
> Now, this isn't the most efficient method of doing it - for example,
> you could scan from the start offset to the end offset rather than
> from the beginning again; or use the styledText array of the field
> which gives you the encoding of each style run in the field - this
> would save the by-char lookup. Perhaps an interesting exercise to see
> how fast it can be made?
>
> -*-
>
> So this is the solution for 4.6.4->6.7.11. In 7+ the internal
> structure of the field *did* change, it moved to using a string for
> each paragraph rather than a mixed style-run approach - i.e. the
> internal data structure for each paragraph is either a unicode string
> or a native string (although you can't tell the difference in 7 as
> that's an internal detail). In order for the approach to work in 7.x,
> the 4.6.4 internal structure would need to be recreated from the text
> of the field. This is definitely possible to do - basically the
> approach 4.6.4 used was to convert all chars it could to native,
> leaving the rest as unicode. So:
>
>   xxxXyZwww (uppercase are unicode only chars, lowercase are
> can-be-native unicode chars)
>
> Would end up with:
>
>   xxx - native
>   X - unicode
>   y - native
>   Z - unicode
>   www - native
>
> Once split up like this, rather than accessing the encoding property
> of the field you would use the encoding derived by splitting up the
> text content field in the above manner.
>
> -*-
>
> Of course, having said that (and testing in 7.0) - the encoding
> property of char ranges in the field should probably return 'unicode'
> for unicode only chars, and native for can-be-native chars. I'd need
> to look into why it doesn't currently - but if it did, I *think* the
> above code would work in 7+ as well as 5.5+. (I've filed
> http://quality.livecode.com/show_bug.cgi?id=20811 so I don't forget to
> have a look!).
>
> Warmest Regards,
>
> Mark.
>