What is LC's internal text format?

Geoff Canyon gcanyon at gmail.com
Tue Nov 13 05:06:48 EST 2018


I don't *think* I'm confusing binary string/data with binary numbers -- I
was just trying to illustrate that when a Latin Small Letter A (U+0061)
gets encoded, somewhere there is stored (four bytes, one of which is) a
byte 97, i.e. the bit sequence 1100001, unless computers don't work that
way anymore.

What I now see is tripping me up is the implicit cast to a character you're
saying that charToNum supports, without the corresponding cast to a number
supported in numToChar -- i.e. this fails:

put textEncode("a","UTF-32") into X;put numtochar(byte 1 of X)

while this works:

put textEncode("a","UTF-32") into X;put numtochar(bytetonum(byte 1 of X))

Thanks for the insight,

Geoff

On Tue, Nov 13, 2018 at 12:03 AM Mark Waddingham via use-livecode <
use-livecode at lists.runrev.com> wrote:

> On 2018-11-13 08:35, Geoff Canyon via use-livecode wrote:
> > So then why does put textEncode("a","UTF-32") into X;put chartonum(byte
> > 1
> > of X) put 97?
>
> Because:
>
>    1) textEncode("a", "UTF-32") produces the byte sequence <97,0,0,0>
>    2) byte 1 of <97,0,0,0> is <97>
>    3) charToNum(<97>) first converts the byte <97> into a native string
> which is "a" (as the 97 is the code for 'a' in the native encoding
> table), then converts that (native) char to a number -> 97
>
> > That implies that "byte" 1 is "a", not 1100001.
>
> 1100001 is 97 but printed in base-2.
>
> FWIW, I think you are confusing 'binary string' with 'binary number' -
> these are not the same thing.
>
> A 'binary string' (internally the data type is 'Data') is a sequence of
> bytes (just as a 'string' is a sequence of
> characters/codepoints/codeunits).
>
> A 'binary number' is a number which has been rendered to a string with
> base-2.
>
> Bytes are like characters (and codepoints, and codeunits) in that they
> are 'abstract' things - they aren't numbers, and have no direct
> conversion to them - which is why we have byteToNum, numToByte,
> nativeCharToNum, numToNativeChar, codepointToNum and numToCodepoint.
>
> The charToNum and numToChar functions are actually deprecated /
> considered legacy - as their function (when useUnicode is set to true)
> depends on processing unicode text as binary data - which isn't how
> unicode works post-7 (indeed, there was no way to fold their behavior
> into the new model - hence the deprecation, and replacement with
> nativeCharToNum / numToNativeChar).
>
> You'll notice that there is no modern 'charToNum'/'numToChar' - just
> 'codepointToNum'/'numToCodepoint'. A codepoint is an index into the
> (large - 21-bit) Unicode code table; Unicode characters can be composed
> of multiple codepoints (e.g. [e,combining-acute] and thus don't have a
> 'number' per-se.
>
> Warmest Regards,
>
> Mark.
>
> >
> > I've looked in the dictionary and I don't see anything that comes close
> > to
> > describing this.
> >
> > gc
> >
> > On Mon, Nov 12, 2018 at 10:21 PM Mark Waddingham via use-livecode <
> > use-livecode at lists.runrev.com> wrote:
> >
> >> On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote:
> >> > On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
> >> > use-livecode at lists.runrev.com> wrote:
> >> > Unless I'm misunderstanding, this hasn't been my observation. Using
> >> > offset
> >> > on a string that has been textEncodet()ed to UTF-32 returns values
> that
> >> > are
> >> > 4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't
> >> > it
> >> > return the actual offsets (except when it fails)? Also, 𐀁 encodes to
> >> > 00010001, and routines that convert to UTF-32 and then use offset will
> >> > find
> >> > five instances of that character in the UTF-32 encoding because of
> >> > improper
> >> > boundaries. To see this, run this code:
> >> >
> >> > on mouseUp
> >> >    put textencode("𐀁","UTF-32") into X
> >> >    put textencode("𐀁𐀁𐀁","UTF-32") into Y
> >> >    put offset(X,Y,1)
> >> > end mouseUp
> >> >
> >> > That will return 2, meaning that it found the encoding for X starting
> >> > at
> >> > character 2 + 1 = 3 of Y. In other words, it found X using the last
> >> > half of
> >> > the first "𐀁" and the first half of the second "𐀁"
> >>
> >> The textEncode function generates binary data which is composed of
> >> bytes. When you use binary data in a text function (which offset is),
> >> the engine uses a compatability conversion which treats the sequence
> >> of
> >> bytes as a sequence of native characters (this preserves what happened
> >> pre-7.0 when strings were only ever native, and as such binary and
> >> string were essentially the same thing).
> >>
> >> So if you textEncode a 1 (native) character string as UTF-32, you will
> >> get a four byte string, which will then turn back into a 4 (native)
> >> character string when passed to offset.
> >>
> >> Warmest Regards,
> >>
> >> Mark.
> >>
> >> --
> >> Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
> >> LiveCode: Everyone can create apps
> >>
> >> _______________________________________________
> >> use-livecode mailing list
> >> use-livecode at lists.runrev.com
> >> Please visit this url to subscribe, unsubscribe and manage your
> >> subscription preferences:
> >> http://lists.runrev.com/mailman/listinfo/use-livecode
> > _______________________________________________
> > use-livecode mailing list
> > use-livecode at lists.runrev.com
> > Please visit this url to subscribe, unsubscribe and manage your
> > subscription preferences:
> > http://lists.runrev.com/mailman/listinfo/use-livecode
>
> --
> Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
> LiveCode: Everyone can create apps
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode



More information about the use-livecode mailing list