First 1000 characters without loop?

Mark Waddingham mark at livecode.com
Fri Jun 23 05:09:59 EDT 2017


On 2017-06-23 03:07, Peter W A Wood via use-livecode wrote:
> Some Unicode characters, such as emojis, have to be represented by two
> codepoints in UTF-16 (known as surrogates) so they take four bytes not
> two. Additionally, the number of bytes for characters with accents
> will take either one codepoint or two depending on whether they have
> been coded in pre-composed or decomposed form. (e.g. ç can be either
> U+0063 U+0327 (decomposed) or U+00E7 (precomposed).
> 
> So it is isn’t easy to estimate the number of bytes in a UTF-16 string.

The number of bytes used by a string when encoded as UTF-16 is '2 * the 
number of codeunits in tString'.

The number of codeunits in a string in LiveCode is a stored property of 
the string, so doesn't require any computation. (We took the decision 
that regardless of how a string is stored internally, it should always 
be possible to ask for the number of codeunits in constant time, and to 
be able to look up a codeunit in constant time).

Note: codeunit is not the same as codepoint and codepoint is not the 
same as character. Both codepoint and character require scanning the 
string (in the general case) to both compute the i'th one, and to 
compute the length.

In contrast (to UTF-16), if you want the number of bytes a string takes 
up in UTF-8 encoding then you also have to scan the string as a 
codepoint in UTF-8 can be 1-4 bytes in length.

> I would guess that LiveCode will store the characters of a string in
> single bytes if all the letters of the string conform to ISO-8859-1.
> So if you can be certain that your text is all ISO-8859-1 encoded, you
> can estimate at 1 byte per character. (The guess is base on the fact
> that the first 256 Unicode code points replicate ISO-8859-1).

Almost true - the engine stores strings which can be fit into the 
running platform's 'legacy' (in terms of pre 7.0) encoding (ISO8859-1, 
Latin-1, MacRoman) in that encoding in memory. This means that stacks 
written pre-unicode will use the same amount of memory, same amount of 
processing time as they did before.

The reason this works is because all three of those encodings have the 
property that when they are converted to Unicode, the number of 
codeunits in the Unicode version is the same as the number of codes 
(indeed, bytes in this case) in the original string.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list