What is LC's internal text format?

Mark Waddingham mark at livecode.com
Tue Nov 13 01:21:35 EST 2018


On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote:
> On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
> use-livecode at lists.runrev.com> wrote:
> Unless I'm misunderstanding, this hasn't been my observation. Using 
> offset
> on a string that has been textEncodet()ed to UTF-32 returns values that 
> are
> 4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't 
> it
> return the actual offsets (except when it fails)? Also, 𐀁 encodes to
> 00010001, and routines that convert to UTF-32 and then use offset will 
> find
> five instances of that character in the UTF-32 encoding because of 
> improper
> boundaries. To see this, run this code:
> 
> on mouseUp
>    put textencode("𐀁","UTF-32") into X
>    put textencode("𐀁𐀁𐀁","UTF-32") into Y
>    put offset(X,Y,1)
> end mouseUp
> 
> That will return 2, meaning that it found the encoding for X starting 
> at
> character 2 + 1 = 3 of Y. In other words, it found X using the last 
> half of
> the first "𐀁" and the first half of the second "𐀁"

The textEncode function generates binary data which is composed of 
bytes. When you use binary data in a text function (which offset is), 
the engine uses a compatability conversion which treats the sequence of 
bytes as a sequence of native characters (this preserves what happened 
pre-7.0 when strings were only ever native, and as such binary and 
string were essentially the same thing).

So if you textEncode a 1 (native) character string as UTF-32, you will 
get a four byte string, which will then turn back into a 4 (native) 
character string when passed to offset.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list