What is LC's internal text format?
monte at appisle.net
Tue Nov 13 18:44:38 EST 2018
> On 14 Nov 2018, at 6:33 am, Ben Rubinstein via use-livecode <use-livecode at lists.runrev.com> wrote:
> That's really helpful - and in parts eye-opening - thanks Mark.
> I have a few follow-up questions.
> Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
Internally we have different types of values. So we have MCStringRef which is the thing which either contains a buffer of native chars or a buffer of UTF-16 chars. There are others. For example, MCNumberRef will either hold a 32 bit signed int or a double. These are returned by numeric operations where there’s no string representation of a number. So:
put 1.0 into tNumber # tNumber holds an MCStringRef
put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef
The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size & byte count.
put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
Then if we do something like:
set the text of field “foo” to tFoo
tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the buffer over and say it’s a native encoded string. There’s no checking to see if it’s a UTF-8 string and decoding with that etc.
Then the string is put into the field.
If you remember that mergJSON issue you reported where mergJSON returns UTF-8 data and you were putting it into a field and it looked funny this is why.
> > CodepointOffset has signature 'integer codepointOffset(string)', so when you
> > pass a binary string (data) value to it, the data value gets converted to a
> > string by interpreting it as a sequence of bytes in the native encoding.
> OK - so one message I take are that in fact one should never invoke codepointOffset on a binary string. Should it actually throw an error in this case?
No, as mentioned above values can move to and from different types according to the operations performed on them and this is largely opaque to the scripter. If you do a text operation on a binary string then there’s an implicit conversion to a native encoded string. You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.
> By the same token, probably one should only use 'byte', 'byteOffset', 'byteToNum' etc with binary strings - would it be better, to avoid confusion, if char, offset, charToNum should refuse to operate on a binary string?
That would not be backwards compatible.
>> e.g. In the case of &, it can either take two data arguments, or two
>> string arguments. In this case, if both arguments are data, then the result
>> will be data. Otherwise both arguments will be converted to strings, and a
>> string returned.
> The second message I take is that one needs to be very careful, if operating on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by concatenating with a simple quoted string, as this may cause it to be silently converted to a non-binary string. (I presume that 'put "simple string" after/before pBinaryString' will cause a conversion in the same way as "&"? What about 'put "!" into char x of pBinaryString?)
When concatenating if both left and right are binary strings (MCDataRef) then there’s no conversion of either to string however we do not currently have a way to declare a literal as a binary string (might be nice if we did!) so you would need to:
put textEncode("simple string”, “UTF-8”) after pBinaryString
> The engine can tell whether a string is 'native' or UTF16. When the engine is converting a binary string to 'string', does it always interpret the source as the native 8-bit encoding, or does it have some heuristic to decide whether it would be more plausible to interpret the source as UTF16?
No it does not try to interpret. ICU has a charset detector that will give you a list of possible charsets along with a confidence. It could be implemented as a separate api:
get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array of charset/confidence pairs
get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) -> charset
Feel free to feature request that!
More information about the Use-livecode