First 1000 characters without loop?
Rick Harrison
harrison at all-auctions.com
Fri Jun 23 10:33:03 EDT 2017
Hi Mark,
Thank you for your verbose answers to questions.
That’s really really deep stuff!
I’m so thankful that the engine takes care of all
of this stuff so that the rest of us don’t have to!
Cheers,
Rick
> On Jun 23, 2017, at 4:17 AM, Mark Waddingham via use-livecode <use-livecode at lists.runrev.com> wrote:
>
> On 2017-06-22 23:18, Richard Gaskin via use-livecode wrote:
>> With many chunk expressions, I would imagine it does. With line
>> chunks, for example, the engine needs to walk through the string,
>> comparing each character to CR, counting the found CRs as it goes.
>
> Yes - essentially that is the case (although technically it looks for LF, not CR as currently - for better or for worse - the engine assumes line means LF as the separator, and normalizes line endings appropriately on a per-platform basis when you 'import' things as text into LiveCode).
>
>> In this case, though, I believe it doesn't need a loop per se, since
>> AFAIK character are fixed-size entities internally (Mark Waddingham,
>> is that true that UTF-16 gives us two-bytes per char across the
>> board?).
>
> No this is not quite true - characters are not fixed sized entities from the computer's point of view. In LiveCode 'character' means 'grapheme' - which is roughly what human's consider to be characters in terms of writing and editing.
>
> Indeed, there are several concepts here:
>
> 1) character: a character is a sequence of Unicode codepoints
>
> 2) codepoint: a codepoint is the index into the Unicode code table (which has space for 1 million or so definitions)
>
> 3) codeunit: a codeunit is an index into the Basic Multilingual Plane (BMP) - the first 65536 Unicode codes. The BMP contains a block of codes called 'surrogates' which aren't actually codes in themselves, but allow two codeunits to be used to express a codepoint for any code defined above 65536.
>
> Some examples:
>
> Character 'a':
>
> This is (as you might expect) always a single codepoint, and, indeed, always a single codeunit (in Unicode 'a' is encoded with the same code as it is in ASCII).
>
> Character 'a-acute':
>
> This can be either represented as a single codepoint (and codeunit) 'a-acute' (the same code as a-acute has in the ISO-8859-1 encoding, a strict superset of ASCII).
>
> Or it can be represented as two codepoints 'a', 'combining-acute'. In both cases, these codepoints are in the BMP, so each codepoint is represented as a single codeunit.
>
> Character 'smiling face with open mouth emoji':
>
> This has code 0x1F603 - meaning it falls outside of the BMP (it is > 65535). It is a single codepoint, but requires two codeunits to encode.
>
> Some comparisons:
>
> ASCII, ISO8859-1, Latin-1 and MacRoman are all 'single-codepoint' encodings - all characters which those encodings can express are encoded as a single codepoint.
>
> Unicode is a 'multi-code' encoding - characters may require any number of codepoints to express. For example:
>
> - In Indic languages (which have a somewhat different structure than languages like English, French, German etc.), many codepoints are often needed to represent what humans might consider a 'character'.
>
> - You can stack any number of defined 'combining accents' onto a base character. You can have a character such as a-acute-underbar-ring-grave-cedilla-umlaut if you want.
>
> - Emoji codepoints can be prefixed by 'variation selectors' which allow customization of things like face color.
>
> Basically, Unicode is a model for encoding writing systems with the aim that (over time) it can be used to represent *any* writing system which exists now or existed in the past. In order to do this in a tractable way (i.e. a way which could be implemented maintainably on modern systems) it uses an abstract model (sequences of codepoints which form characters). Due to this it can sometimes seem a little 'odd' but then it is trying to model things which were not designed to necessarily fit into a computer's viewpoint of the world - writing systems have evolved organically without thought on how a computer might need to process them.
>
> In terms of LiveCode, then you have access to 'character', 'codepoint' and 'codeunit' chunks. In general:
>
> - character access for general strings is never constant time, as characters can require multiple codepoints.
>
> - codepoint access for general strings is never constant time, as codepoints can require two codeunits to encode.
>
> - codeunits access for general strings is always constant time.
>
> Internally, the engine will keep things which can be represented in the platform's native encoding as native as much as possible (the native encodings have the property that 1 character = 1 codepoint = 1 codeunit); otherwise it will (currently) store things internally as sequences of codeunits in the UTF-16 encoding. (How this might be done in future may well change in order to permit optimization, for example pure Greek or Russian text currently has a penalty compared to English text as it will always require UTF-16 internal encoding; however with the advent of Emoji and other such things, pure English text itself is becoming much less common).
>
> Most of the time 'character' is the most appropriate thing to use for reading strings, whilst codepoints can be used to build up strings of characters.
>
> The presence of 'codeunit' chunks is to allow optimization of critical routines in script as you can be sure that getting 'codeunit X of tString' is an array lookup (i.e. one step of computer processing, no loop needed).
>
> Warmest Regards,
>
> Mark.
>
> --
> Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
> LiveCode: Everyone can create apps
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list