First 1000 characters without loop?

Mark Waddingham mark at livecode.com
Fri Jun 23 04:17:13 EDT 2017


On 2017-06-22 23:18, Richard Gaskin via use-livecode wrote:
> With many chunk expressions, I would imagine it does.  With line
> chunks, for example, the engine needs to walk through the string,
> comparing each character to CR, counting the found CRs as it goes.

Yes - essentially that is the case (although technically it looks for 
LF, not CR as currently - for better or for worse - the engine assumes 
line means LF as the separator, and normalizes line endings 
appropriately on a per-platform basis when you 'import' things as text 
into LiveCode).

> In this case, though, I believe it doesn't need a loop per se, since
> AFAIK character are fixed-size entities internally (Mark Waddingham,
> is that true that UTF-16 gives us two-bytes per char across the
> board?).

No this is not quite true - characters are not fixed sized entities from 
the computer's point of view. In LiveCode 'character' means 'grapheme' - 
which is roughly what human's consider to be characters in terms of 
writing and editing.

Indeed, there are several concepts here:

   1) character: a character is a sequence of Unicode codepoints

   2) codepoint: a codepoint is the index into the Unicode code table 
(which has space for 1 million or so definitions)

   3) codeunit: a codeunit is an index into the Basic Multilingual Plane 
(BMP) - the first 65536 Unicode codes. The BMP contains a block of codes 
called 'surrogates' which aren't actually codes in themselves, but allow 
two codeunits to be used to express a codepoint for any code defined 
above 65536.

Some examples:

Character 'a':

This is (as you might expect) always a single codepoint, and, indeed, 
always a single codeunit (in Unicode 'a' is encoded with the same code 
as it is in ASCII).

Character 'a-acute':

This can be either represented as a single codepoint (and codeunit) 
'a-acute' (the same code as a-acute has in the ISO-8859-1 encoding, a 
strict superset of ASCII).

Or it can be represented as two codepoints 'a', 'combining-acute'. In 
both cases, these codepoints are in the BMP, so each codepoint is 
represented as a single codeunit.

Character 'smiling face with open mouth emoji':

This has code 0x1F603 - meaning it falls outside of the BMP (it is > 
65535). It is a single codepoint, but requires two codeunits to encode.

Some comparisons:

ASCII, ISO8859-1, Latin-1 and MacRoman are all 'single-codepoint' 
encodings - all characters which those encodings can express are encoded 
as a single codepoint.

Unicode is a 'multi-code' encoding - characters may require any number 
of codepoints to express. For example:

   - In Indic languages (which have a somewhat different structure than 
languages like English, French, German etc.), many codepoints are often 
needed to represent what humans might consider a 'character'.

   - You can stack any number of defined 'combining accents' onto a base 
character. You can have a character such as 
a-acute-underbar-ring-grave-cedilla-umlaut if you want.

   - Emoji codepoints can be prefixed by 'variation selectors' which 
allow customization of things like face color.

Basically, Unicode is a model for encoding writing systems with the aim 
that (over time) it can be used to represent *any* writing system which 
exists now or existed in the past. In order to do this in a tractable 
way (i.e. a way which could be implemented maintainably on modern 
systems) it uses an abstract model (sequences of codepoints which form 
characters). Due to this it can sometimes seem a little 'odd' but then 
it is trying to model things which were not designed to necessarily fit 
into a computer's viewpoint of the world - writing systems have evolved 
organically without thought on how a computer might need to process 
them.

In terms of LiveCode, then you have access to 'character', 'codepoint' 
and 'codeunit' chunks. In general:

    - character access for general strings is never constant time, as 
characters can require multiple codepoints.

    - codepoint access for general strings is never constant time, as 
codepoints can require two codeunits to encode.

    - codeunits access for general strings is always constant time.

Internally, the engine will keep things which can be represented in the 
platform's native encoding as native as much as possible (the native 
encodings have the property that 1 character = 1 codepoint = 1 
codeunit); otherwise it will (currently) store things internally as 
sequences of codeunits in the UTF-16 encoding. (How this might be done 
in future may well change in order to permit optimization, for example 
pure Greek or Russian text currently has a penalty compared to English 
text as it will always require UTF-16 internal encoding; however with 
the advent of Emoji and other such things, pure English text itself is 
becoming much less common).

Most of the time 'character' is the most appropriate thing to use for 
reading strings, whilst codepoints can be used to build up strings of 
characters.

The presence of 'codeunit' chunks is to allow optimization of critical 
routines in script as you can be sure that getting 'codeunit X of 
tString' is an array lookup (i.e. one step of computer processing, no 
loop needed).

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list