What is LC's internal text format?

Mark Waddingham mark at livecode.com
Tue Nov 13 03:03:07 EST 2018


On 2018-11-13 08:35, Geoff Canyon via use-livecode wrote:
> So then why does put textEncode("a","UTF-32") into X;put chartonum(byte 
> 1
> of X) put 97?

Because:

   1) textEncode("a", "UTF-32") produces the byte sequence <97,0,0,0>
   2) byte 1 of <97,0,0,0> is <97>
   3) charToNum(<97>) first converts the byte <97> into a native string 
which is "a" (as the 97 is the code for 'a' in the native encoding 
table), then converts that (native) char to a number -> 97

> That implies that "byte" 1 is "a", not 1100001.

1100001 is 97 but printed in base-2.

FWIW, I think you are confusing 'binary string' with 'binary number' - 
these are not the same thing.

A 'binary string' (internally the data type is 'Data') is a sequence of 
bytes (just as a 'string' is a sequence of 
characters/codepoints/codeunits).

A 'binary number' is a number which has been rendered to a string with 
base-2.

Bytes are like characters (and codepoints, and codeunits) in that they 
are 'abstract' things - they aren't numbers, and have no direct 
conversion to them - which is why we have byteToNum, numToByte, 
nativeCharToNum, numToNativeChar, codepointToNum and numToCodepoint.

The charToNum and numToChar functions are actually deprecated / 
considered legacy - as their function (when useUnicode is set to true) 
depends on processing unicode text as binary data - which isn't how 
unicode works post-7 (indeed, there was no way to fold their behavior 
into the new model - hence the deprecation, and replacement with 
nativeCharToNum / numToNativeChar).

You'll notice that there is no modern 'charToNum'/'numToChar' - just 
'codepointToNum'/'numToCodepoint'. A codepoint is an index into the 
(large - 21-bit) Unicode code table; Unicode characters can be composed 
of multiple codepoints (e.g. [e,combining-acute] and thus don't have a 
'number' per-se.

Warmest Regards,

Mark.

> 
> I've looked in the dictionary and I don't see anything that comes close 
> to
> describing this.
> 
> gc
> 
> On Mon, Nov 12, 2018 at 10:21 PM Mark Waddingham via use-livecode <
> use-livecode at lists.runrev.com> wrote:
> 
>> On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote:
>> > On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
>> > use-livecode at lists.runrev.com> wrote:
>> > Unless I'm misunderstanding, this hasn't been my observation. Using
>> > offset
>> > on a string that has been textEncodet()ed to UTF-32 returns values that
>> > are
>> > 4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't
>> > it
>> > return the actual offsets (except when it fails)? Also, 𐀁 encodes to
>> > 00010001, and routines that convert to UTF-32 and then use offset will
>> > find
>> > five instances of that character in the UTF-32 encoding because of
>> > improper
>> > boundaries. To see this, run this code:
>> >
>> > on mouseUp
>> >    put textencode("𐀁","UTF-32") into X
>> >    put textencode("𐀁𐀁𐀁","UTF-32") into Y
>> >    put offset(X,Y,1)
>> > end mouseUp
>> >
>> > That will return 2, meaning that it found the encoding for X starting
>> > at
>> > character 2 + 1 = 3 of Y. In other words, it found X using the last
>> > half of
>> > the first "𐀁" and the first half of the second "𐀁"
>> 
>> The textEncode function generates binary data which is composed of
>> bytes. When you use binary data in a text function (which offset is),
>> the engine uses a compatability conversion which treats the sequence 
>> of
>> bytes as a sequence of native characters (this preserves what happened
>> pre-7.0 when strings were only ever native, and as such binary and
>> string were essentially the same thing).
>> 
>> So if you textEncode a 1 (native) character string as UTF-32, you will
>> get a four byte string, which will then turn back into a 4 (native)
>> character string when passed to offset.
>> 
>> Warmest Regards,
>> 
>> Mark.
>> 
>> --
>> Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
>> LiveCode: Everyone can create apps
>> 
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list