What is LC's internal text format?
bobsneidar at iotecdigital.com
Tue Nov 20 12:11:47 EST 2018
I'm not grasping the import of the question here, but it seems to me that the question is about what happens "under the hood", in relation to the format of the data as it is exposed to any I/O. It seems to me that in this context it's academic. If there is a problem with what's going on "under the hood", that of course needs to be addressed. But if it's not affecting what the developer/user "sees" in terms of the format of the data, I don't see the point.
> On Nov 20, 2018, at 08:33 , Ben Rubinstein via use-livecode <use-livecode at lists.runrev.com> wrote:
> Hi Monte,
> Thanks for this, sorry for delayed reply - I've been away.
> >> Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
> > Internally we have different types of values. So we have MCStringRef which is the thing which either contains a buffer of native chars or a buffer of UTF-16 chars. There are others.
> > The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size & byte count.
> > So:
> > put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
> > Then if we do something like:
> > set the text of field “foo” to tFoo
> > tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the buffer over and say it’s a native encoded string. There’s no checking to see if it’s a UTF-8 string and decoding with that etc.
> So my question would be, is this helpful? If, given any MCDataRef (i.e. 'binary string') LC makes the assumption - when it needs an MCStringRef - that the binary string is 'native' - then I would think it will be wrong more often that is correct!
> IIUC, the chief ways to obtain an MCDataRef are by reading a file in binary mode, or by calling textEncode (or loading a non-file URL???). Insofar as one could make an assumption at all, my guess is that in the first case the data is more likely to be UTF8; and whatever is most likely in the second case, 'native' is about the least likely. (If the assumption was UTF16 it would at least make more sense.)
> Would it not be better to refuse to make an assumption, i.e. require an explicit conversion? If you want to proceed on the assumption that a file is 'native' text, read it as text; if you know what it is, read it as binary and use textEncode. If you used textEncode anyway (or numToByte) then obviously you know what it is, and when you want to make a string out of it you can tell LC how to interpret it. Wouldn't it be better to throw an error if passing an MCDataRef where an MCStringRef is required, than introduce subtle errors by just making (in my opinion implausible) assumptions?
> And now that the thought has occurred to me - when a URL with a non-file protocol is used a source of value, what is the type of the value - MCStringRef or MCDataRef?
> thanks for the continuing education!
> On 13/11/2018 23:44, Monte Goulding via use-livecode wrote:
>>> On 14 Nov 2018, at 6:33 am, Ben Rubinstein via use-livecode <use-livecode at lists.runrev.com> wrote:
>>> That's really helpful - and in parts eye-opening - thanks Mark.
>>> I have a few follow-up questions.
>>> Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
>> Internally we have different types of values. So we have MCStringRef which is the thing which either contains a buffer of native chars or a buffer of UTF-16 chars. There are others. For example, MCNumberRef will either hold a 32 bit signed int or a double. These are returned by numeric operations where there’s no string representation of a number. So:
>> put 1.0 into tNumber # tNumber holds an MCStringRef
>> put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef
>> The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size & byte count.
>> put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
>> Then if we do something like:
>> set the text of field “foo” to tFoo
>> tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the buffer over and say it’s a native encoded string. There’s no checking to see if it’s a UTF-8 string and decoding with that etc.
>> Then the string is put into the field.
>> If you remember that mergJSON issue you reported where mergJSON returns UTF-8 data and you were putting it into a field and it looked funny this is why.
>>>> CodepointOffset has signature 'integer codepointOffset(string)', so when you
>>>> pass a binary string (data) value to it, the data value gets converted to a
>>>> string by interpreting it as a sequence of bytes in the native encoding.
>>> OK - so one message I take are that in fact one should never invoke codepointOffset on a binary string. Should it actually throw an error in this case?
>> No, as mentioned above values can move to and from different types according to the operations performed on them and this is largely opaque to the scripter. If you do a text operation on a binary string then there’s an implicit conversion to a native encoded string. You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.
>>> By the same token, probably one should only use 'byte', 'byteOffset', 'byteToNum' etc with binary strings - would it be better, to avoid confusion, if char, offset, charToNum should refuse to operate on a binary string?
>> That would not be backwards compatible.
>>>> e.g. In the case of &, it can either take two data arguments, or two
>>>> string arguments. In this case, if both arguments are data, then the result
>>>> will be data. Otherwise both arguments will be converted to strings, and a
>>>> string returned.
>>> The second message I take is that one needs to be very careful, if operating on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by concatenating with a simple quoted string, as this may cause it to be silently converted to a non-binary string. (I presume that 'put "simple string" after/before pBinaryString' will cause a conversion in the same way as "&"? What about 'put "!" into char x of pBinaryString?)
>> When concatenating if both left and right are binary strings (MCDataRef) then there’s no conversion of either to string however we do not currently have a way to declare a literal as a binary string (might be nice if we did!) so you would need to:
>> put textEncode("simple string”, “UTF-8”) after pBinaryString
>>> The engine can tell whether a string is 'native' or UTF16. When the engine is converting a binary string to 'string', does it always interpret the source as the native 8-bit encoding, or does it have some heuristic to decide whether it would be more plausible to interpret the source as UTF16?
>> No it does not try to interpret. ICU has a charset detector that will give you a list of possible charsets along with a confidence. It could be implemented as a separate api:
>> get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array of charset/confidence pairs
>> get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) -> charset
>> Feel free to feature request that!
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
More information about the use-livecode