First 1000 characters without loop?

Peter W A Wood peterwawood at gmail.com
Thu Jun 22 21:07:21 EDT 2017


Richard

> How can we know which is in use for a given string?
> 
> Suppose I wanted to process a lot of text, so performance is critical. Using bytes would be optimal, since any chunk type or even Unicode characters may vary in length.
> 
> So if I wanted to create an index of byte offsets into a large chunk of text, how would I know how long a character is?

Some Unicode characters, such as emojis, have to be represented by two codepoints in UTF-16 (known as surrogates) so they take four bytes not two. Additionally, the number of bytes for characters with accents will take either one codepoint or two depending on whether they have been coded in pre-composed or decomposed form. (e.g. ç can be either U+0063 U+0327 (decomposed) or U+00E7 (precomposed).

So it is isn’t easy to estimate the number of bytes in a UTF-16 string.

I would guess that LiveCode will store the characters of a string in single bytes if all the letters of the string conform to ISO-8859-1. So if you can be certain that your text is all ISO-8859-1 encoded, you can estimate at 1 byte per character. (The guess is base on the fact that the first 256 Unicode code points replicate ISO-8859-1).

Regards

Peter





More information about the use-livecode mailing list