What is LC's internal text format?

Ben Rubinstein benr_mc at cogapp.com
Tue Nov 20 11:33:19 EST 2018


Hi Monte,

Thanks for this, sorry for delayed reply - I've been away.

 >> Does textEncode _always_ return a binary string? Or, if invoked with 
"CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
 >
 > Internally we have different types of values. So we have MCStringRef which 
is the thing which either contains a buffer of native chars or a buffer of 
UTF-16 chars. There are others.
...
 > The return type of textEncode is an MCDataRef. This is a byte buffer, 
buffer size & byte count.
 >
 > So:
 > put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
 >
 > Then if we do something like:
 > set the text of field “foo” to tFoo
 >
 > tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move 
the buffer over and say it’s a native encoded string. There’s no checking to 
see if it’s a UTF-8 string and decoding with that etc.

So my question would be, is this helpful?  If, given any MCDataRef (i.e. 
'binary string') LC makes the assumption - when it needs an MCStringRef - that 
the binary string is 'native' - then I would think it will be wrong more often 
that is correct!

IIUC, the chief ways to obtain an MCDataRef are by reading a file in binary 
mode, or by calling textEncode (or loading a non-file URL???). Insofar as one 
could make an assumption at all, my guess is that in the first case the data 
is more likely to be UTF8; and whatever is most likely in the second case, 
'native' is about the least likely. (If the assumption was UTF16 it would at 
least make more sense.)

Would it not be better to refuse to make an assumption, i.e. require an 
explicit conversion? If you want to proceed on the assumption that a file is 
'native' text, read it as text; if you know what it is, read it as binary and 
use textEncode. If you used textEncode anyway (or numToByte) then obviously 
you know what it is, and when you want to make a string out of it you can tell 
LC how to interpret it. Wouldn't it be better to throw an error if passing an 
MCDataRef where an MCStringRef is required, than introduce subtle errors by 
just making (in my opinion implausible) assumptions?

And now that the thought has occurred to me - when a URL with a non-file 
protocol is used a source of value, what is the type of the value - 
MCStringRef or MCDataRef?

thanks for the continuing education!

Ben

On 13/11/2018 23:44, Monte Goulding via use-livecode wrote:
> 
> 
>> On 14 Nov 2018, at 6:33 am, Ben Rubinstein via use-livecode <use-livecode at lists.runrev.com> wrote:
>>
>> That's really helpful - and in parts eye-opening - thanks Mark.
>>
>> I have a few follow-up questions.
>>
>> Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
> 
> Internally we have different types of values. So we have MCStringRef which is the thing which either contains a buffer of native chars or a buffer of UTF-16 chars. There are others. For example, MCNumberRef will either hold a 32 bit signed int or a double. These are returned by numeric operations where there’s no string representation of a number. So:
> 
> put 1.0 into tNumber # tNumber holds an MCStringRef
> put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef
> 
> The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size & byte count.
> 
> So:
> put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef
> 
> Then if we do something like:
> set the text of field “foo” to tFoo
> 
> tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the buffer over and say it’s a native encoded string. There’s no checking to see if it’s a UTF-8 string and decoding with that etc.
> 
> Then the string is put into the field.
> 
> If you remember that mergJSON issue you reported where mergJSON returns UTF-8 data and you were putting it into a field and it looked funny this is why.
>>
>>> CodepointOffset has signature 'integer codepointOffset(string)', so when you
>>> pass a binary string (data) value to it, the data value gets converted to a
>>> string by interpreting it as a sequence of bytes in the native encoding.
>>
>> OK - so one message I take are that in fact one should never invoke codepointOffset on a binary string. Should it actually throw an error in this case?
> 
> No, as mentioned above values can move to and from different types according to the operations performed on them and this is largely opaque to the scripter. If you do a text operation on a binary string then there’s an implicit conversion to a native encoded string. You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.
>>
>> By the same token, probably one should only use 'byte', 'byteOffset', 'byteToNum' etc with binary strings - would it be better, to avoid confusion, if char, offset, charToNum should refuse to operate on a binary string?
> 
> That would not be backwards compatible.
>>
>>> e.g. In the case of &, it can either take two data arguments, or two
>>> string arguments. In this case, if both arguments are data, then the result
>>> will be data. Otherwise both arguments will be converted to strings, and a
>>> string returned.
>> The second message I take is that one needs to be very careful, if operating on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by concatenating with a simple quoted string, as this may cause it to be silently converted to a non-binary string. (I presume that 'put "simple string" after/before pBinaryString' will cause a conversion in the same way as "&"? What about 'put "!" into char x of pBinaryString?)
> 
> When concatenating if both left and right are binary strings (MCDataRef) then there’s no conversion of either to string however we do not currently have a way to declare a literal as a binary string (might be nice if we did!) so you would need to:
> 
> put textEncode("simple string”, “UTF-8”) after pBinaryString
> 
>>
>> The engine can tell whether a string is 'native' or UTF16. When the engine is converting a binary string to 'string', does it always interpret the source as the native 8-bit encoding, or does it have some heuristic to decide whether it would be more plausible to interpret the source as UTF16?
> 
> No it does not try to interpret. ICU has a charset detector that will give you a list of possible charsets along with a confidence. It could be implemented as a separate api:
> 
> get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array of charset/confidence pairs
> 
> get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) -> charset
> 
> Feel free to feature request that!
> 
> Cheers
> 
> Monte
> 
> 
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
> 




More information about the use-livecode mailing list