What is LC's internal text format?
gcanyon at gmail.com
Tue Nov 13 01:15:06 EST 2018
On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
use-livecode at lists.runrev.com> wrote:
> Text strings in LiveCode are native encoded (MacRoman or ISO 8859) where
> possible and where you don’t explicitly tell the engine
> For what it’s worth using `offset` is the wrong thing to do if you have
> textEncoded your strings into binary data. You want to use `byteOffset`
> otherwise the engine will convert your data to a string and assume native
> encoding. This is probably why you are getting some case insensitivity.
Unless I'm misunderstanding, this hasn't been my observation. Using offset
on a string that has been textEncodet()ed to UTF-32 returns values that are
4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't it
return the actual offsets (except when it fails)? Also, 𐀁 encodes to
00010001, and routines that convert to UTF-32 and then use offset will find
five instances of that character in the UTF-32 encoding because of improper
boundaries. To see this, run this code:
put textencode("𐀁","UTF-32") into X
put textencode("𐀁𐀁𐀁","UTF-32") into Y
That will return 2, meaning that it found the encoding for X starting at
character 2 + 1 = 3 of Y. In other words, it found X using the last half of
the first "𐀁" and the first half of the second "𐀁"
More information about the Use-livecode