byteLen()?

Mark Waddingham mark at livecode.com
Thu Mar 9 13:35:48 EST 2017


On 2017-03-09 19:06, Richard Gaskin via use-livecode wrote:
> Thanks. I don't mind the verbosity, but I could use some clarity:
> 
> There's been talk of LC using UTF-16 internally, but when I do this:
> 
> on mouseUp
>    put "Hello" into s
>    put the number of bytes of s
> end mouseUp
> 
> ...I get "5".
> 
> When does LC use UTF-16, and when it's not UTF-16 is it still
> ISO-8959-1 or UTF-8?

Internally strings are stored as either UTF-16 or in the native encoding 
(MacRoman, Latin-1) depending on the content of the string and such (the 
engine transparently switches internal encoding as necessary). However, 
this is an internal implementation detail - it might do something 
completely different in the future...

Before 7, the idea of 'byte' and 'char' were synonymous - if you used a 
string in the context of something expecting text it interpreted as 
being a string encoded in the native encoding, if you used a string in 
the context of something expecting binary data it interpreted as being 
just plain bytes.

With the advent of 7 it is necessary to treat text and binary separately 
- they aren't the same thing at all for the simple reason that text only 
becomes binary when you choose a specific text encoding and apply it to 
the unicode string.

In order to ensure that code written prior to 7 worked identically in 7 
it was necessary to add an automatic conversion between text and binary 
which preserved the previous behavior which (essentially) viewed text 
and binary strings as being the same thing.

Indeed, in the above code what is actually happening is this:

on mouseUp
   put "Hello" into a
   put the number of bytes of <implicit-text-to-data>(s)
end mouseUp

One important property which existed before 7 was that:

    the number of bytes in s == the number of chars in s

However, the definition of 'char' changed in 7 to mean a Unicode 
grapheme - something which will often require many bytes to encode in 
any encoding (e.g. [e, combining-acute] is a perfectly valid way to 
express e-acute in Unicode - taking two codepoints, and not one). In 
order to keep the above equivalence (which would break many things if it 
were not kept) the implicit-text-to-data conversion is defined as 
follows:

   repeat for each char tChar in tString
     get textEncode(tChar, "native")
     if textDecode(it, "native") is tChar then
        put it after tData
     else
        put "?" after tData
     end if
   end repeat

(Note: The engine does work quite hard to keep things as equivalent as 
possible - it normalizes tString to NFC first so that it doesn't matter 
if the string has passed through a process which has happened to 
decompose it, or if it has come from a source which favours decomposed 
representations - most notably Mac HFS filenames).

This approach means that any multi-codepoint character in Unicode still 
maps to a single byte - and any non-updated code which manipulates 
strings as if they are data will still work (albeit with some data loss 
in regards the original Unicode string - which it wasn't written to 
understand anyway).

In the future, it is entirely possible that we will make it a runtime 
error to implicitly convert between data and string (don't worry, it 
wouldn't be the default behavior) because if you aren't clear about how 
you are doing the conversion (i.e. which conversion you are using) it is 
a potential source of hard to find errors in code.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list