What is LC's internal text format?

That's really helpful - and in parts eye-opening - thanks Mark.

I have a few follow-up questions.

Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", 
"ISO-8859-1", "MacRoman" or "Native", does it return a string?

 > CodepointOffset has signature 'integer codepointOffset(string)', so when you
 > pass a binary string (data) value to it, the data value gets converted to a
 > string by interpreting it as a sequence of bytes in the native encoding.

OK - so one message I take are that in fact one should never invoke 
codepointOffset on a binary string. Should it actually throw an error in this 

By the same token, probably one should only use 'byte', 'byteOffset', 
'byteToNum' etc with binary strings - would it be better, to avoid confusion, 
if char, offset, charToNum should refuse to operate on a binary string?

> e.g. In the case of &, it can either take two data arguments, or two
> string arguments. In this case, if both arguments are data, then the result
> will be data. Otherwise both arguments will be converted to strings, and a
> string returned.
The second message I take is that one needs to be very careful, if operating 
on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by 
concatenating with a simple quoted string, as this may cause it to be silently 
converted to a non-binary string. (I presume that 'put "simple string" 
after/before pBinaryString' will cause a conversion in the same way as "&"? 
What about 'put "!" into char x of pBinaryString?)

The engine can tell whether a string is 'native' or UTF16. When the engine is 
converting a binary string to 'string', does it always interpret the source as 
the native 8-bit encoding, or does it have some heuristic to decide whether it 
would be more plausible to interpret the source as UTF16?

Thanks again for all the detail!


On 13/11/2018 13:31, Mark Waddingham via use-livecode wrote:
> On 2018-11-13 12:43, Ben Rubinstein via use-livecode wrote:
>> I'm grateful for all the information, but _outraged_ that the thread
>> that I carefully created separate from the offset thread was so
>> quickly hijacked for the continuing (useful!) detailed discussion on
>> that topic.
> The phrase 'attempting to herd cats' springs to mind ;)
>> From recent contributions on both threads I'm getting some more
>> insights, but I'd really like to understand clearly what's going on. I
>> do think that I should have asked this question more broadly: how does
>> the engine represent values internally?
> The engine uses a number of distinct types 'behind the scenes'. The ones
> pertinent to LCS (there are many many more which LCS never sees) are:
>    - nothing: a type with a single value nothing/null)
>    - boolean: a type with two values true/false
>    - number: a type which can either store a 32-bit integer *or* a double
>    - string: a type which can either store a sequence of native (single byte) 
> codes, or a sequence of unicode (two byte - UTF-16) codes
>    - name: a type which stores a string, but uniques the string so that 
> caseless and exact equality checking is constant time
>    - data: a type which stores a sequence of bytes
>    - array: a type which stores (using a hashtable) a mapping from 'names' to 
> any other storage value type
> The LCS part of the engine then sits on top of these core types, providing
> various conversions depending on context.
> All LCS syntax is actually typed - meaning that when you pass a value to any
> piece of LCS syntax, each argument is converted to the type required.
> e.g. charToNativeNum() has signature 'integer charToNativeNum(string)' meaning 
> that it
> expects a string as input and will return a number as output.
> Some syntax is overloaded - meaning that it can act in slightly different (but 
> always consistent) ways depending on the type of the arguments.
> e.g. & has signatures 'string &(string, string)' and 'data &(data, data)'.
> In simple cases where there is no overload, type conversion occurs exactly as 
> required:
> e.g. In the case of charToNativeNum() - it has no overload, so always expects 
> a string
> which means that the input argument will always undergo a 'convert to string' 
> operation.
> The convert to string operation operates as follows:
>     - nothing -> ""
>     - boolean -> "true" or "false"
>     - number -> decimal representation of the number, using numberFormat
>     - string -> stays the same
>     - name -> uses the string the name contains
>     - data -> converts to a string using the native encoding
>     - array -> converts to empty (a very old semantic which probably does more 
> harm than good!)
> In cases where syntax is overloaded, type conversion generally happens in 
> syntax-specific sequence in order to preserve consistency:
> e.g. In the case of &, it can either take two data arguments, or two string 
> arguments. In this case,
> if both arguments are data, then the result will be data. Otherwise both 
> arguments will be converted
> to strings, and a string returned.
>> From Monte I get that the internal encoding for 'string' may be
>> MacRoman, ISO 8859 (I thought it would be CP1252), or UTF16 -
>> presumably with some attribute to tell the engine which one in each
>> case.
> Monte wasn't quite correct - on Mac it is MacRoman or UTF-16, on Windows it
> is CP1252 or UTF-16, on Linux it is IOS8859-1 or UTF-16. There is an internal
> flag in a string value which says whether its character sequence is 
> single-byte (native)
> or double-byte (UTF_16).
>> So then my question is whether a 'binary string' is a pure blob, with
>> no clues as to interpretation; or whether in fact it does have some
>> attributes to suggest that it might be interpreted as UTF8, UTF132
>> etc?
> Data (binary string) values are pure blobs - they are sequences of bytes - it has
> no knowledge of where it came from. Indeed, that would generally be a bad idea 
> as you
> wouldn't get repeatable semantics (i.e. a value from one codepath which is 
> data, might
> have a different effect in context from one which is fetched from somewhere 
> else).
> That being said, the engine does store some flags on values - but purely for 
> optimization.
> i.e. To save later work. For example, a string value can store its (double) 
> numeric value in
> it - which saves multiple 'convert to number' operations performed on the same 
> (pointer wise) string (due to the copy-on-write nature of values, and the fact 
> that all literals are unique names, pointer-wise equality of values occurs a 
> great deal).
>> If there are no such attributes, how does codepointOffset operate when
>> passed a binary string?
> CodepointOffset is has signature 'integer codepointOffset(string)', so when you
> pass a binary string (data) value to it, the data value gets converted to a 
> string
> by interpreting it as a sequence of bytes in the native encoding.
>> If there are such attributes, how do they get set? Evidently if
>> textEncode is used, the engine knows that the resulting value is the
>> requested encoding. But what happens if the program reads a file as
>> 'binary' - presumable the result is a binary string, how does the
>> engine treat it?
> There are no attributes of that ilk. When you read a file as binary you get 
> data (binary
> string) values - which means when you pass them to string taking 
> functions/commands that
> data gets interpreted as a sequence of bytes in the native encoding. This is 
> why you must
> always explicitly textEncode/textDecode data values when you know they are not 
> representing
> native encoded text.
>> Is there any way at LiveCode script level to detect what a value is,
>> in the above terms?
> Yes - the 'is strictly' operators:
>    is strictly nothing
>    is strictly a boolean
>    is strictly an integer - a number which has internal rep 32-bit int
>    is strictly a real - a number which has internal rep double
>    is strictly a string
>    is strictly a binary string
>    is strictly an array
> It should be noted that 'is strictly' reports only how that value is stored 
> and not anything based on the value itself. This only really applies to 'an 
> integer' and 'a real' - you can store an integer in a double and all LCS 
> arithmetic operators act on doubles.
> e.g. (1+2) is strictly an integer -> false
>       (1+2) is strictly a real -> true
> In contrast, though, *some* syntax will return numbers which are stored 
> internally as integers:
> e.g. nativeCharToNum("a") is strictly an integer -> true
> I should point out that what 'is strictly' operators return for any given 
> context is not stable in the sense that future engine versions might return 
> different things. e.g. We might optimize arithmetic in the future (if we can 
> figure out a way to do it without performance penalty!) so that things which 
> are definitely integers, are stored as integers (e.g. 1 + 2 in the above).
>> And one more question: if a string, or binary string, is saved in a
>> 'binary' file, are the bytes stored on disk a faithful rendition of
>> the bytes that composed the value in memory, or an interpretation of
>> some kind?
> What happens when you read or write data or string values to a file depends on 
> how you opened the file.
> If you opened the file for binary (whether reading or writing), when you read 
> you will get data, when you write string values will be converted to data via 
> the native encoding (default rule).
> If you opened the file for text, then the engine will try and determine (using 
> a BOM) the existing text encoding of the file. If it can't determine it (if 
> for example, you are opening a file for write which doesn't exist), it will 
> assume it is encoded as native.
> Otherwise the file will have an explicit encoding associated with it specified 
> by you - reading from it will interpret the bytes in that explicit encoding; 
> while writing to it will expect string values which will be encoded 
> appropriately. In the latter case if you write data values, they will first be 
> converted to a string (assuming native encoding) and then written as strings 
> in the file's encoding (i.e. default type conversion applies).
> Essentially you can view file's a typed-stream - if you opened for binary 
> read/write give/take data; if you opened for text then read/write give/take 
> strings and default type conversion rules apply.
> Warmest Regards,
> Mark.

