What is LC's internal text format?

Ben Rubinstein benr_mc at cogapp.com
Mon Nov 12 17:35:39 EST 2018


This is something that I've been wondering about for a while.

My unexamined assumption had been that in the 'new' fully unicode LC, text was 
held in UTF-8. However when I saved some text strings in binary I got 
something like UTF-8 - but not quite. And the recent experiments with offset 
suggested that LC at the least is able to distinguish between a string which 
is fully represented as single-byte (or perhaps ASCII?). And the reports of 
the ingenious investigators using UTF-32 to speed up offsets, and discovering 
that offset somehow managed to be case-insensitive in this case, made me 
wonder whether after using textEncode(xt, "UTF-32") LC marks the string in 
some way to give a clue about how to interpret it as text?

So could someone who is familar with this bit of the engine enlighten us? In 
particular:
- What is the internal format?
- Is it different on different platforms?
- Given that it appears to include a flag to indicate whether it is 
single-byte text or not, are there any other attributes?
- Does saving a string in 'binary' file faithfully report the internal format?

TIA,

Ben




More information about the use-livecode mailing list