First 1000 characters without loop?
Mark Waddingham
mark at livecode.com
Fri Jun 23 05:09:59 EDT 2017
On 2017-06-23 03:07, Peter W A Wood via use-livecode wrote:
> Some Unicode characters, such as emojis, have to be represented by two
> codepoints in UTF-16 (known as surrogates) so they take four bytes not
> two. Additionally, the number of bytes for characters with accents
> will take either one codepoint or two depending on whether they have
> been coded in pre-composed or decomposed form. (e.g. ç can be either
> U+0063 U+0327 (decomposed) or U+00E7 (precomposed).
>
> So it is isn’t easy to estimate the number of bytes in a UTF-16 string.
The number of bytes used by a string when encoded as UTF-16 is '2 * the
number of codeunits in tString'.
The number of codeunits in a string in LiveCode is a stored property of
the string, so doesn't require any computation. (We took the decision
that regardless of how a string is stored internally, it should always
be possible to ask for the number of codeunits in constant time, and to
be able to look up a codeunit in constant time).
Note: codeunit is not the same as codepoint and codepoint is not the
same as character. Both codepoint and character require scanning the
string (in the general case) to both compute the i'th one, and to
compute the length.
In contrast (to UTF-16), if you want the number of bytes a string takes
up in UTF-8 encoding then you also have to scan the string as a
codepoint in UTF-8 can be 1-4 bytes in length.
> I would guess that LiveCode will store the characters of a string in
> single bytes if all the letters of the string conform to ISO-8859-1.
> So if you can be certain that your text is all ISO-8859-1 encoded, you
> can estimate at 1 byte per character. (The guess is base on the fact
> that the first 256 Unicode code points replicate ISO-8859-1).
Almost true - the engine stores strings which can be fit into the
running platform's 'legacy' (in terms of pre 7.0) encoding (ISO8859-1,
Latin-1, MacRoman) in that encoding in memory. This means that stacks
written pre-unicode will use the same amount of memory, same amount of
processing time as they did before.
The reason this works is because all three of those encodings have the
property that when they are converted to Unicode, the number of
codeunits in the Unicode version is the same as the number of codes
(indeed, bytes in this case) in the original string.
Warmest Regards,
Mark.
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode
mailing list