What is LC's internal text format?
Ben Rubinstein
benr_mc at cogapp.com
Mon Nov 12 17:35:39 EST 2018
This is something that I've been wondering about for a while.
My unexamined assumption had been that in the 'new' fully unicode LC, text was
held in UTF-8. However when I saved some text strings in binary I got
something like UTF-8 - but not quite. And the recent experiments with offset
suggested that LC at the least is able to distinguish between a string which
is fully represented as single-byte (or perhaps ASCII?). And the reports of
the ingenious investigators using UTF-32 to speed up offsets, and discovering
that offset somehow managed to be case-insensitive in this case, made me
wonder whether after using textEncode(xt, "UTF-32") LC marks the string in
some way to give a clue about how to interpret it as text?
So could someone who is familar with this bit of the engine enlighten us? In
particular:
- What is the internal format?
- Is it different on different platforms?
- Given that it appears to include a flag to indicate whether it is
single-byte text or not, are there any other attributes?
- Does saving a string in 'binary' file faithfully report the internal format?
TIA,
Ben
More information about the use-livecode
mailing list