What is LC's internal text format?
benr at cogapp.com
Tue Nov 13 14:33:56 EST 2018
That's really helpful - and in parts eye-opening - thanks Mark.
I have a few follow-up questions.
Does textEncode _always_ return a binary string? Or, if invoked with "CP1252",
"ISO-8859-1", "MacRoman" or "Native", does it return a string?
> CodepointOffset has signature 'integer codepointOffset(string)', so when you
> pass a binary string (data) value to it, the data value gets converted to a
> string by interpreting it as a sequence of bytes in the native encoding.
OK - so one message I take are that in fact one should never invoke
codepointOffset on a binary string. Should it actually throw an error in this
By the same token, probably one should only use 'byte', 'byteOffset',
'byteToNum' etc with binary strings - would it be better, to avoid confusion,
if char, offset, charToNum should refuse to operate on a binary string?
> e.g. In the case of &, it can either take two data arguments, or two
> string arguments. In this case, if both arguments are data, then the result
> will be data. Otherwise both arguments will be converted to strings, and a
> string returned.
The second message I take is that one needs to be very careful, if operating
on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by
concatenating with a simple quoted string, as this may cause it to be silently
converted to a non-binary string. (I presume that 'put "simple string"
after/before pBinaryString' will cause a conversion in the same way as "&"?
What about 'put "!" into char x of pBinaryString?)
The engine can tell whether a string is 'native' or UTF16. When the engine is
converting a binary string to 'string', does it always interpret the source as
the native 8-bit encoding, or does it have some heuristic to decide whether it
would be more plausible to interpret the source as UTF16?
Thanks again for all the detail!
On 13/11/2018 13:31, Mark Waddingham via use-livecode wrote:
> On 2018-11-13 12:43, Ben Rubinstein via use-livecode wrote:
>> I'm grateful for all the information, but _outraged_ that the thread
>> that I carefully created separate from the offset thread was so
>> quickly hijacked for the continuing (useful!) detailed discussion on
>> that topic.
> The phrase 'attempting to herd cats' springs to mind ;)
>> From recent contributions on both threads I'm getting some more
>> insights, but I'd really like to understand clearly what's going on. I
>> do think that I should have asked this question more broadly: how does
>> the engine represent values internally?
> The engine uses a number of distinct types 'behind the scenes'. The ones
> pertinent to LCS (there are many many more which LCS never sees) are:
> - nothing: a type with a single value nothing/null)
> - boolean: a type with two values true/false
> - number: a type which can either store a 32-bit integer *or* a double
> - string: a type which can either store a sequence of native (single byte)
> codes, or a sequence of unicode (two byte - UTF-16) codes
> - name: a type which stores a string, but uniques the string so that
> caseless and exact equality checking is constant time
> - data: a type which stores a sequence of bytes
> - array: a type which stores (using a hashtable) a mapping from 'names' to
> any other storage value type
> The LCS part of the engine then sits on top of these core types, providing
> various conversions depending on context.
> All LCS syntax is actually typed - meaning that when you pass a value to any
> piece of LCS syntax, each argument is converted to the type required.
> e.g. charToNativeNum() has signature 'integer charToNativeNum(string)' meaning
> that it
> expects a string as input and will return a number as output.
> Some syntax is overloaded - meaning that it can act in slightly different (but
> always consistent) ways depending on the type of the arguments.
> e.g. & has signatures 'string &(string, string)' and 'data &(data, data)'.
> In simple cases where there is no overload, type conversion occurs exactly as
> e.g. In the case of charToNativeNum() - it has no overload, so always expects
> a string
> which means that the input argument will always undergo a 'convert to string'
> The convert to string operation operates as follows:
> - nothing -> ""
> - boolean -> "true" or "false"
> - number -> decimal representation of the number, using numberFormat
> - string -> stays the same
> - name -> uses the string the name contains
> - data -> converts to a string using the native encoding
> - array -> converts to empty (a very old semantic which probably does more
> harm than good!)
> In cases where syntax is overloaded, type conversion generally happens in
> syntax-specific sequence in order to preserve consistency:
> e.g. In the case of &, it can either take two data arguments, or two string
> arguments. In this case,
> if both arguments are data, then the result will be data. Otherwise both
> arguments will be converted
> to strings, and a string returned.
>> From Monte I get that the internal encoding for 'string' may be
>> MacRoman, ISO 8859 (I thought it would be CP1252), or UTF16 -
>> presumably with some attribute to tell the engine which one in each
> Monte wasn't quite correct - on Mac it is MacRoman or UTF-16, on Windows it
> is CP1252 or UTF-16, on Linux it is IOS8859-1 or UTF-16. There is an internal
> flag in a string value which says whether its character sequence is
> single-byte (native)
> or double-byte (UTF_16).
>> So then my question is whether a 'binary string' is a pure blob, with
>> no clues as to interpretation; or whether in fact it does have some
>> attributes to suggest that it might be interpreted as UTF8, UTF132
> Data (binary string) values are pure blobs - they are sequences of bytes - it has
> no knowledge of where it came from. Indeed, that would generally be a bad idea
> as you
> wouldn't get repeatable semantics (i.e. a value from one codepath which is
> data, might
> have a different effect in context from one which is fetched from somewhere
> That being said, the engine does store some flags on values - but purely for
> i.e. To save later work. For example, a string value can store its (double)
> numeric value in
> it - which saves multiple 'convert to number' operations performed on the same
> (pointer wise) string (due to the copy-on-write nature of values, and the fact
> that all literals are unique names, pointer-wise equality of values occurs a
> great deal).
>> If there are no such attributes, how does codepointOffset operate when
>> passed a binary string?
> CodepointOffset is has signature 'integer codepointOffset(string)', so when you
> pass a binary string (data) value to it, the data value gets converted to a
> by interpreting it as a sequence of bytes in the native encoding.
>> If there are such attributes, how do they get set? Evidently if
>> textEncode is used, the engine knows that the resulting value is the
>> requested encoding. But what happens if the program reads a file as
>> 'binary' - presumable the result is a binary string, how does the
>> engine treat it?
> There are no attributes of that ilk. When you read a file as binary you get
> data (binary
> string) values - which means when you pass them to string taking
> functions/commands that
> data gets interpreted as a sequence of bytes in the native encoding. This is
> why you must
> always explicitly textEncode/textDecode data values when you know they are not
> native encoded text.
>> Is there any way at LiveCode script level to detect what a value is,
>> in the above terms?
> Yes - the 'is strictly' operators:
> is strictly nothing
> is strictly a boolean
> is strictly an integer - a number which has internal rep 32-bit int
> is strictly a real - a number which has internal rep double
> is strictly a string
> is strictly a binary string
> is strictly an array
> It should be noted that 'is strictly' reports only how that value is stored
> and not anything based on the value itself. This only really applies to 'an
> integer' and 'a real' - you can store an integer in a double and all LCS
> arithmetic operators act on doubles.
> e.g. (1+2) is strictly an integer -> false
> (1+2) is strictly a real -> true
> In contrast, though, *some* syntax will return numbers which are stored
> internally as integers:
> e.g. nativeCharToNum("a") is strictly an integer -> true
> I should point out that what 'is strictly' operators return for any given
> context is not stable in the sense that future engine versions might return
> different things. e.g. We might optimize arithmetic in the future (if we can
> figure out a way to do it without performance penalty!) so that things which
> are definitely integers, are stored as integers (e.g. 1 + 2 in the above).
>> And one more question: if a string, or binary string, is saved in a
>> 'binary' file, are the bytes stored on disk a faithful rendition of
>> the bytes that composed the value in memory, or an interpretation of
>> some kind?
> What happens when you read or write data or string values to a file depends on
> how you opened the file.
> If you opened the file for binary (whether reading or writing), when you read
> you will get data, when you write string values will be converted to data via
> the native encoding (default rule).
> If you opened the file for text, then the engine will try and determine (using
> a BOM) the existing text encoding of the file. If it can't determine it (if
> for example, you are opening a file for write which doesn't exist), it will
> assume it is encoded as native.
> Otherwise the file will have an explicit encoding associated with it specified
> by you - reading from it will interpret the bytes in that explicit encoding;
> while writing to it will expect string values which will be encoded
> appropriately. In the latter case if you write data values, they will first be
> converted to a string (assuming native encoding) and then written as strings
> in the file's encoding (i.e. default type conversion applies).
> Essentially you can view file's a typed-stream - if you opened for binary
> read/write give/take data; if you opened for text then read/write give/take
> strings and default type conversion rules apply.
> Warmest Regards,
More information about the use-livecode