byteLen()?
Mark Waddingham
mark at livecode.com
Thu Mar 9 13:35:48 EST 2017
On 2017-03-09 19:06, Richard Gaskin via use-livecode wrote:
> Thanks. I don't mind the verbosity, but I could use some clarity:
>
> There's been talk of LC using UTF-16 internally, but when I do this:
>
> on mouseUp
> put "Hello" into s
> put the number of bytes of s
> end mouseUp
>
> ...I get "5".
>
> When does LC use UTF-16, and when it's not UTF-16 is it still
> ISO-8959-1 or UTF-8?
Internally strings are stored as either UTF-16 or in the native encoding
(MacRoman, Latin-1) depending on the content of the string and such (the
engine transparently switches internal encoding as necessary). However,
this is an internal implementation detail - it might do something
completely different in the future...
Before 7, the idea of 'byte' and 'char' were synonymous - if you used a
string in the context of something expecting text it interpreted as
being a string encoded in the native encoding, if you used a string in
the context of something expecting binary data it interpreted as being
just plain bytes.
With the advent of 7 it is necessary to treat text and binary separately
- they aren't the same thing at all for the simple reason that text only
becomes binary when you choose a specific text encoding and apply it to
the unicode string.
In order to ensure that code written prior to 7 worked identically in 7
it was necessary to add an automatic conversion between text and binary
which preserved the previous behavior which (essentially) viewed text
and binary strings as being the same thing.
Indeed, in the above code what is actually happening is this:
on mouseUp
put "Hello" into a
put the number of bytes of <implicit-text-to-data>(s)
end mouseUp
One important property which existed before 7 was that:
the number of bytes in s == the number of chars in s
However, the definition of 'char' changed in 7 to mean a Unicode
grapheme - something which will often require many bytes to encode in
any encoding (e.g. [e, combining-acute] is a perfectly valid way to
express e-acute in Unicode - taking two codepoints, and not one). In
order to keep the above equivalence (which would break many things if it
were not kept) the implicit-text-to-data conversion is defined as
follows:
repeat for each char tChar in tString
get textEncode(tChar, "native")
if textDecode(it, "native") is tChar then
put it after tData
else
put "?" after tData
end if
end repeat
(Note: The engine does work quite hard to keep things as equivalent as
possible - it normalizes tString to NFC first so that it doesn't matter
if the string has passed through a process which has happened to
decompose it, or if it has come from a source which favours decomposed
representations - most notably Mac HFS filenames).
This approach means that any multi-codepoint character in Unicode still
maps to a single byte - and any non-updated code which manipulates
strings as if they are data will still work (albeit with some data loss
in regards the original Unicode string - which it wasn't written to
understand anyway).
In the future, it is entirely possible that we will make it a runtime
error to implicitly convert between data and string (don't worry, it
wouldn't be the default behavior) because if you aren't clear about how
you are doing the conversion (i.e. which conversion you are using) it is
a potential source of hard to find errors in code.
Warmest Regards,
Mark.
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode
mailing list