What is LC's internal text format?

Geoff Canyon gcanyon at gmail.com
Tue Nov 13 02:35:58 EST 2018


So then why does put textEncode("a","UTF-32") into X;put chartonum(byte 1
of X) put 97? That implies that "byte" 1 is "a", not 1100001. Likewise, put
textEncode("㍁","UTF-32") into X;put chartonum(byte 1 of X) puts 65.

I've looked in the dictionary and I don't see anything that comes close to
describing this.

gc

On Mon, Nov 12, 2018 at 10:21 PM Mark Waddingham via use-livecode <
use-livecode at lists.runrev.com> wrote:

> On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote:
> > On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
> > use-livecode at lists.runrev.com> wrote:
> > Unless I'm misunderstanding, this hasn't been my observation. Using
> > offset
> > on a string that has been textEncodet()ed to UTF-32 returns values that
> > are
> > 4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't
> > it
> > return the actual offsets (except when it fails)? Also, 𐀁 encodes to
> > 00010001, and routines that convert to UTF-32 and then use offset will
> > find
> > five instances of that character in the UTF-32 encoding because of
> > improper
> > boundaries. To see this, run this code:
> >
> > on mouseUp
> >    put textencode("𐀁","UTF-32") into X
> >    put textencode("𐀁𐀁𐀁","UTF-32") into Y
> >    put offset(X,Y,1)
> > end mouseUp
> >
> > That will return 2, meaning that it found the encoding for X starting
> > at
> > character 2 + 1 = 3 of Y. In other words, it found X using the last
> > half of
> > the first "𐀁" and the first half of the second "𐀁"
>
> The textEncode function generates binary data which is composed of
> bytes. When you use binary data in a text function (which offset is),
> the engine uses a compatability conversion which treats the sequence of
> bytes as a sequence of native characters (this preserves what happened
> pre-7.0 when strings were only ever native, and as such binary and
> string were essentially the same thing).
>
> So if you textEncode a 1 (native) character string as UTF-32, you will
> get a four byte string, which will then turn back into a 4 (native)
> character string when passed to offset.
>
> Warmest Regards,
>
> Mark.
>
> --
> Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
> LiveCode: Everyone can create apps
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode



More information about the use-livecode mailing list