What is LC's internal text format?

Mark Waddingham mark at livecode.com
Tue Nov 13 08:31:35 EST 2018


On 2018-11-13 12:43, Ben Rubinstein via use-livecode wrote:
> I'm grateful for all the information, but _outraged_ that the thread
> that I carefully created separate from the offset thread was so
> quickly hijacked for the continuing (useful!) detailed discussion on
> that topic.

The phrase 'attempting to herd cats' springs to mind ;)

> From recent contributions on both threads I'm getting some more
> insights, but I'd really like to understand clearly what's going on. I
> do think that I should have asked this question more broadly: how does
> the engine represent values internally?

The engine uses a number of distinct types 'behind the scenes'. The ones
pertinent to LCS (there are many many more which LCS never sees) are:

   - nothing: a type with a single value nothing/null)
   - boolean: a type with two values true/false
   - number: a type which can either store a 32-bit integer *or* a double
   - string: a type which can either store a sequence of native (single 
byte) codes, or a sequence of unicode (two byte - UTF-16) codes
   - name: a type which stores a string, but uniques the string so that 
caseless and exact equality checking is constant time
   - data: a type which stores a sequence of bytes
   - array: a type which stores (using a hashtable) a mapping from 
'names' to any other storage value type

The LCS part of the engine then sits on top of these core types, 
providing
various conversions depending on context.

All LCS syntax is actually typed - meaning that when you pass a value to 
any
piece of LCS syntax, each argument is converted to the type required.

e.g. charToNativeNum() has signature 'integer charToNativeNum(string)' 
meaning that it
expects a string as input and will return a number as output.

Some syntax is overloaded - meaning that it can act in slightly 
different (but always consistent) ways depending on the type of the 
arguments.

e.g. & has signatures 'string &(string, string)' and 'data &(data, 
data)'.

In simple cases where there is no overload, type conversion occurs 
exactly as required:

e.g. In the case of charToNativeNum() - it has no overload, so always 
expects a string
which means that the input argument will always undergo a 'convert to 
string' operation.

The convert to string operation operates as follows:

    - nothing -> ""
    - boolean -> "true" or "false"
    - number -> decimal representation of the number, using numberFormat
    - string -> stays the same
    - name -> uses the string the name contains
    - data -> converts to a string using the native encoding
    - array -> converts to empty (a very old semantic which probably does 
more harm than good!)

In cases where syntax is overloaded, type conversion generally happens 
in syntax-specific sequence in order to preserve consistency:

e.g. In the case of &, it can either take two data arguments, or two 
string arguments. In this case,
if both arguments are data, then the result will be data. Otherwise both 
arguments will be converted
to strings, and a string returned.

> From Monte I get that the internal encoding for 'string' may be
> MacRoman, ISO 8859 (I thought it would be CP1252), or UTF16 -
> presumably with some attribute to tell the engine which one in each
> case.

Monte wasn't quite correct - on Mac it is MacRoman or UTF-16, on Windows 
it
is CP1252 or UTF-16, on Linux it is IOS8859-1 or UTF-16. There is an 
internal
flag in a string value which says whether its character sequence is 
single-byte (native)
or double-byte (UTF_16).

> So then my question is whether a 'binary string' is a pure blob, with
> no clues as to interpretation; or whether in fact it does have some
> attributes to suggest that it might be interpreted as UTF8, UTF132
> etc?

Data (binary string) values are pure blobs - they are sequences of bytes 
- it has
no knowledge of where it came from. Indeed, that would generally be a 
bad idea as you
wouldn't get repeatable semantics (i.e. a value from one codepath which 
is data, might
have a different effect in context from one which is fetched from 
somewhere else).

That being said, the engine does store some flags on values - but purely 
for optimization.
i.e. To save later work. For example, a string value can store its 
(double) numeric value in
it - which saves multiple 'convert to number' operations performed on 
the same (pointer wise) string (due to the copy-on-write nature of 
values, and the fact that all literals are unique names, pointer-wise 
equality of values occurs a great deal).

> If there are no such attributes, how does codepointOffset operate when
> passed a binary string?

CodepointOffset is has signature 'integer codepointOffset(string)', so 
when you
pass a binary string (data) value to it, the data value gets converted 
to a string
by interpreting it as a sequence of bytes in the native encoding.

> If there are such attributes, how do they get set? Evidently if
> textEncode is used, the engine knows that the resulting value is the
> requested encoding. But what happens if the program reads a file as
> 'binary' - presumable the result is a binary string, how does the
> engine treat it?

There are no attributes of that ilk. When you read a file as binary you 
get data (binary
string) values - which means when you pass them to string taking 
functions/commands that
data gets interpreted as a sequence of bytes in the native encoding. 
This is why you must
always explicitly textEncode/textDecode data values when you know they 
are not representing
native encoded text.

> Is there any way at LiveCode script level to detect what a value is,
> in the above terms?

Yes - the 'is strictly' operators:

   is strictly nothing
   is strictly a boolean
   is strictly an integer - a number which has internal rep 32-bit int
   is strictly a real - a number which has internal rep double
   is strictly a string
   is strictly a binary string
   is strictly an array

It should be noted that 'is strictly' reports only how that value is 
stored and not anything based on the value itself. This only really 
applies to 'an integer' and 'a real' - you can store an integer in a 
double and all LCS arithmetic operators act on doubles.

e.g. (1+2) is strictly an integer -> false
      (1+2) is strictly a real -> true

In contrast, though, *some* syntax will return numbers which are stored 
internally as integers:

e.g. nativeCharToNum("a") is strictly an integer -> true

I should point out that what 'is strictly' operators return for any 
given context is not stable in the sense that future engine versions 
might return different things. e.g. We might optimize arithmetic in the 
future (if we can figure out a way to do it without performance 
penalty!) so that things which are definitely integers, are stored as 
integers (e.g. 1 + 2 in the above).

> And one more question: if a string, or binary string, is saved in a
> 'binary' file, are the bytes stored on disk a faithful rendition of
> the bytes that composed the value in memory, or an interpretation of
> some kind?

What happens when you read or write data or string values to a file 
depends on how you opened the file.

If you opened the file for binary (whether reading or writing), when you 
read you will get data, when you write string values will be converted 
to data via the native encoding (default rule).

If you opened the file for text, then the engine will try and determine 
(using a BOM) the existing text encoding of the file. If it can't 
determine it (if for example, you are opening a file for write which 
doesn't exist), it will assume it is encoded as native.

Otherwise the file will have an explicit encoding associated with it 
specified by you - reading from it will interpret the bytes in that 
explicit encoding; while writing to it will expect string values which 
will be encoded appropriately. In the latter case if you write data 
values, they will first be converted to a string (assuming native 
encoding) and then written as strings in the file's encoding (i.e. 
default type conversion applies).

Essentially you can view file's a typed-stream - if you opened for 
binary read/write give/take data; if you opened for text then read/write 
give/take strings and default type conversion rules apply.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list