Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)
Mark Waddingham
mark at livecode.com
Thu Dec 28 12:45:35 EST 2017
On 2017-12-19 19:43, Mark Waddingham via use-livecode wrote:
> I'm pretty sure it would be possible to write a handler which takes
> the styledText array of a field in 6.7.11 and a list of old indicies,
> returning a list of new char indicies... Would that help?
Paul expressed an interest in how this might work - and he provided some
more background:
-*-
Our main application, HyperRESEARCH, a tool for academics and others
doing qualitative research, relies completely on chunk ranges. It is
essentially a bookmarking tool where users can select some content from
a document, the character position (chunk) is grabbed and the user gives
it a text label and HyperRESEARCH remembers that label "Early Childhood
Behavior X" points to char S to T of document "ABC". All documents,
native text, unicode (utf8 or utf16), rtf, docx, odt, etc. are read into
a LiveCode field, from which the selection is made and the chunk
obtained. HyoperRESEARCH saves a "Study" file that contains a LOT of
these labels and chunks and documents names.
As part of our migration from LC464, which is what the current release
of HyperRESEARCH is based on, we needed a way to convert a character
range created under LC4.6.4 to a range under LC6.7.11 that point to the
exact same string for the same file. Curry Kenworthy, whose libraries we
license for reading MS-Word and Open Office documents built a library
based on an algorithm I came up with to send the original LC464 ranges
to a helper application using sockets or IPC. The helper application
retrieves the strings associated with each chunk, strips white space and
passes the string back to the LC6.7.11 version of the main app, which
then finds the whitespace stripped strings in the same file loaded under
LC6.7.11 with an indexing mechanism to adjust the positions for the
stripped whitespace. It is a bit complicated, but it works reliably.
-*-
From this I infer the following:
1) The study file is a list of triples - label, char chunk, document
filename
2) When using the study file, the original document is loaded into a
field, and the char chunks are used to display labels which the user can
jump to.
3) The char chunks are old-style (pre-5.5) byte indicies not codeunit
indicies
The crux of the problem Paul is having comes down to (3) which has some
background to explain.
Before 7.0, the field was the only part of the engine which naturally
handled Unicode. In these older versions the field stored text as mixed
sequence of style runs of either single bytes (native text) or double
bytes (unicode text).
Between 5.5 and 7.0, indicies used when referencing chars in fields
corresponded to codeunits - this meant that the indicies were
independent of the encoding of the runs in the field. In this case char
N referred to the Nth codeunit in the field, whether up until that point
was all unicode, all native or a mixture of both.
Before 5.5, indicies used when referencing chars in fields corresponded
to bytes - this meant that you had to take into account the encoding of
the runs in the field. In this case, char N referred to the Nth byte in
the field. So if your field had:
abcXYZabc (where XYZ are two byte unicode chars)
Then char 4 would refer to the first byte of the X unicode char and
*not* the two bytes it would have actually taken up.
Now, importantly, the internal structure of the field did not change
between 4.6.4 and 5.5, just how the 'char' chunk worked - in 6.7.11, the
internal structure of the field is still the mixed runs of
unicode/native bytes just as it was in 4.6.4 - the only difference is
what happens if you reference char X to Y of the field.
So solving this problem comes down to finding a means to 'get at' the
internal encoding style runs of a field in 6.7.11. We want a handler:
mapByteRangeToCharRange(pFieldId, pByteFrom, pByteTo)
Returning a pair pCharFrom, pCharTo - where pByteFrom, pByteTo are a
char X to Y range from 4.6.4 and pCharFrom, pCharTo are a char X to Y
range *for the same range* in 6.7.11.
-*-
Before going into the details, an easy way to see the internal mixed
encoding of a field containing unicode in 6.7.11, is to put some text
which is a mixture of native text and unicode text in a field and then
look at its 'text' property. Putting:
Лорем ипсум Lorem ipsum dolor sit amet, pr долор сит амет, вел татион
игнота сцрибентур еи. Вих еа феугиат doctus necessitatibus ассентиор
пхилосопхиа. Феугаитconsulatu disputando comprehensam вивендум вис ет,
мел еррем малорум ат. Хас но видерер лобортис, suscipit detraxit
interesset eum аппетере инсоленс салутатус усу не. Еи дуо лудус яуаеяуе,
ет елитр цорпора пер.
Into a 6.7.11 field and then doing 'put the text of field 1' gives:
????? ????? Lorem ipsum dolor sit amet, pr ????? ??? ????, ??? ??????
?????? ?????????? ??. ??? ?? ??????? doctus necessitatibus ?????????
???????????. ???????consulatu disputando comprehensam ???????? ??? ??,
??? ????? ??????? ??. ??? ?? ??????? ????????, suscipit detraxit
interesset eum ???????? ???????? ????????? ??? ??. ?? ??? ????? ???????,
?? ????? ??????? ???.
Here we see some of how 6.7.11 fields handled unicode. The '?' indicate
that the 'char' being fetched at that index is a unicode codeunit (i.e.
not representable in the native encoding). It is relatively easy to see
by inspection that these match up quite easily - for each cyrillic
letter there is a '?', and the roman letters come through directly.
In contrast if I do the same thing with 4.6.4, I get this:
??>?@?5?<? 8???A?C?<? Lorem ipsum dolor sit amet, pr 4?>?;?>?@? A?8?B?
0?<?5?B?, 2?5?;? B?0?B?8?>?=? 8?3?=?>?B?0? A?F?@?8?1?5?=?B?C?@? 5?8?.
??8?E? 5?0? D?5?C?3?8?0?B? doctus necessitatibus 0?A?A?5?=?B?8?>?@?
??E?8?;?>?A?>???E?8?0?. $?5?C?3?0?8?B?consulatu disputando comprehensam
2?8?2?5?=?4?C?<? 2?8?A? 5?B?, <?5?;? 5?@?@?5?<? <?0?;?>?@?C?<? 0?B?.
%?0?A? =?>? 2?8?4?5?@?5?@? ;?>?1?>?@?B?8?A?, suscipit detraxit
interesset eum 0?????5?B?5?@?5? 8?=?A?>?;?5?=?A? A?0?;?C?B?0?B?C?A?
C?A?C? =?5?. ??8? 4?C?>? ;?C?4?C?A? O?C?0?5?O?C?5?, 5?B? 5?;?8?B?@?
F?>?@???>?@?0? ??5?@?.
In order to make sure this came through vaguely sanely, I've replaced
all bytes < 32 with ?. If you compare with 6.7.11 output you can see
that for each '?' present in 'the text' of the 6.7.11 field, there are
*two* chars in the 4.6.4 output:
Лорем (original) -> ????? (6.7.11) -> ??>?@?5?<? (4.6.4)
This shows quite clearly the difference between 4.6.4 and 6.7.11 in
handling text/char ranges - in 6.7.11 whilst internally each unicode
codeunit takes up two bytes you don't see that, instead you see only a
single 'char'. In comparison in 4.6.4, all the gory details are laid
bare - you see the individual bytes making up the unicode codeunits.
-*-
Now, the above is only a rough way to see the internals of the field -
the ? char in any one place in the text could be an actual '?' or a '?'
which comes about because there is a non-native codeunit there. However,
you can tell the encoding of any one char in a field by looking at the
'encoding' property of the char.
put the encoding of char 1 of field 1 -> unicode
put the encoding of char 30 of field 1 -> native
We can use this information (in 6.7.11) to implement the required
handler (which uses an auxillary handler to map one index):
-- Map a 4.6.4 char (byte) range to a 5.5+ char range.
function mapByteRangeToCharRange pFieldId, pByteFrom, pByteTo
-- Convert the index of the beginning of the range.
local tCharFrom
put mapByteIndexToCharIndex(pFieldId, pByteFrom) into tCharFrom
-- Convert the index of the end of the range. We add 1 to the end
-- offset here so that we find the index of the char after the end
-- char. We need to do this as the byte range of a single unicode
-- char is always 2 bytes long.
local tCharTo
put mapByteIndexToCharIndex(pFieldId, pByteTo + 1) into tCharTo
-- If the range is a singleton, charFrom and charTo will be the
-- same.
if tCharFrom is tCharTo then
return tCharFrom,tCharTo
end if
-- Otherwise it is a multi-char range, and tCharTo will actually
-- be the char after the end of the range (due to the adjustment
-- above).
return tCharFrom, tCharTo - 1
end mapByteRangeToCharRange
-- Map a 4.6.4 char (byte) offset to a 5.5+ char offset.
private function mapByteIndexToCharIndex pFieldId, pByteIndex
-- Char indicies start from 1
local tCharIndex
put 1 into tCharIndex
-- We iterate over the 5.5+ notion of chars until the original 4.6.4
-- byte index is exhausted.
repeat while pByteIndex > 1
-- If the encoding of the char at the 5.5+ index is native, then
it
-- will have required 1 byte in 4.6.4; otherwise it will have
required
-- 2 bytes in 4.6.4.
if the encoding of char tCharIndex of pFieldId is "native" then
subtract 1 from pByteIndex
else
subtract 2 from pByteIndex
end if
-- We've consumed a single 5.5+ char, and either 1 or 2 4.6.4
-- bytes at this point.
add 1 to tCharIndex
end repeat
-- The final char index we computed is the char corresponding to
-- the byte index in 4.6.4.
return tCharIndex
end mapByteIndexToCharIndex
Now, this isn't the most efficient method of doing it - for example, you
could scan from the start offset to the end offset rather than from the
beginning again; or use the styledText array of the field which gives
you the encoding of each style run in the field - this would save the
by-char lookup. Perhaps an interesting exercise to see how fast it can
be made?
-*-
So this is the solution for 4.6.4->6.7.11. In 7+ the internal structure
of the field *did* change, it moved to using a string for each paragraph
rather than a mixed style-run approach - i.e. the internal data
structure for each paragraph is either a unicode string or a native
string (although you can't tell the difference in 7 as that's an
internal detail). In order for the approach to work in 7.x, the 4.6.4
internal structure would need to be recreated from the text of the
field. This is definitely possible to do - basically the approach 4.6.4
used was to convert all chars it could to native, leaving the rest as
unicode. So:
xxxXyZwww (uppercase are unicode only chars, lowercase are
can-be-native unicode chars)
Would end up with:
xxx - native
X - unicode
y - native
Z - unicode
www - native
Once split up like this, rather than accessing the encoding property of
the field you would use the encoding derived by splitting up the text
content field in the above manner.
-*-
Of course, having said that (and testing in 7.0) - the encoding property
of char ranges in the field should probably return 'unicode' for unicode
only chars, and native for can-be-native chars. I'd need to look into
why it doesn't currently - but if it did, I *think* the above code would
work in 7+ as well as 5.5+. (I've filed
http://quality.livecode.com/show_bug.cgi?id=20811 so I don't forget to
have a look!).
Warmest Regards,
Mark.
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode
mailing list