How to find offsets in Unicode Text fast

Mark Waddingham mark at livecode.com
Tue Nov 13 02:07:53 EST 2018


On 2018-11-13 06:35, Geoff Canyon via use-livecode wrote:
> I didn't realize until now that offset() simply fails with some unicode
> strings:
> 
> put offset("a","↘𠜎qeiuruioqeaaa↘𠜎qeiuar",13) -- puts 0
> 
> On Mon, Nov 12, 2018 at 9:17 PM Geoff Canyon <gcanyon at gmail.com> wrote:
> 
>> A few things:
>> 
>> 1. It seems codepointOffset can only find a single character? So it
>> won't work for any search for a multi-character string?
>> 2: codepointOffset seems to work differently for multi-byte characters 
>> and
>> regular characters:
>> 
>> put codepointoffset("e","↘ndatestest",6) -- puts 3
>> put codepointoffset("e","andatestest",6) -- puts 9
>> 
>> 3: It seems that when multi-byte characters are involved, 
>> codepointOffset
>> suffers from the same sort of slow-down as offset does. For example, 
>> in a
>> 145K string with about 20K hits for a single character, a simple
>> codepointOffset routine (below) takes over 10 seconds, while the 
>> item-based
>> routine takes about 0.3 seconds for the same results.

There is something 'funky' going on with the *offset functions - even 
taking into account that codeunitOffset/codepointOffset/byteOffset 
return an absolute position rather than relative - I noticed something 
similar the other day, I'll endeavour to take a look.

Regardless of any gremlins, the speed difference (although I've not seen 
the 'item-based' routine - so don't quite know what that is) is due to 
the fact that in order to compare text you need to do a lot of 
processing.

Unicode is a multi-code representation of text *regardless* of whatever 
encoding: a single (what humans understand to be) character can be 
composed of multiple codes. For example, e-acute can be represented as 
[e-acute] or [e, combining-acute]. Indeed, Unicode allows an arbitrary 
sequence of combining marks to be attributed to a base character 
(whether your text rendering system can deal with such things is another 
matter). Some languages rely entirely on multi-code sequences - e.g. the 
Indic languages; and more recently, Emoji use various 'variation 
selectors' to allow variation in the base set of emoji - which are 
composed of multiple codes.

Therefore, the only difference between UTF-8, UTF-16 and UTF-32 as a 
chosen representation is a balance between density, and storage 
requirement based on language - processing wise they all have to do the 
same 'yoga' to 'work' correctly when you are doing text processing:

   - UTF8: preserves the NUL byte (important for C systems), ASCII is 
1-byte per code, European/Roman languages are generally at most 2-bytes 
per code, every other language 3-4 bytes per code (i.e. it heavily 
penalises Asian languages).

   - UTF16: does not preserve the NUL byte (as its a sequence of 16-bit 
unsigned integers), most languages in common use today (including Indic 
and CJK) are almost entirely encoded as 2-bytes per code, with some 
interspersion with 4-bytes per code (via the 'surrogate pair' 
mechanism).

   - UTF32: does not preserve the NUL byte (as its a sequence of 32-bit 
unsigned integers), all languages require 4 bytes per code, it wastes on 
average 15-bits per code (as almost all unicode codes currently defined 
require a maximum of 17-bits).

Indeed, in a fully correct *general* implementation of Unicode 
processing, UTF32 will perform (on average) half as fast as a 
implementation based on UTF16, and UTF8 will only perform better than 
UTF16 *if* the majority of your texts are ASCII with occasional 
non-ASCII. (The latter is generally the case for European/Roman 
languages, but if you involve the Asian languages then it will, on 
average, be around 1/3rd slower).

[ This observation is based on the fact that all Unicode text processes 
- if implemented in the most efficient way - will end up being memory 
bound on modern processors and as such average speed will be based on 
how many bytes the input string / output string take up ].

So, the reason 'offset' appears slow is that it is doing a lot more work 
than you think. There are two principal processes which need to be 
applied to Unicode text in order to do any sort of comparison:

    - normalization
    - case-folding

Case-folding applies if you want to do caseless rather than 
case-sensitive comparison. It isn't *quite* upper-casing or 
lower-casing, but in fact a mapping from char->upper_char->lower_char so 
that differences in case are removed. Case-folding is a little more 
complex than just code->code mappings - for example ß in German maps to 
SS in full generality. Also there are some hairy edge cases with 
'composed unicode characters' (particularly, but not exclusively when 
considering polytonic Greek).

Normalization is the process of ensuring that two sequences of codes are 
mapped to canonically equivalent sequences of codes - it is *mostly* 
orthogonal to case-folding (at least if you normalize to the decomposed 
form). Basically the process of normalization discards the difference 
between sequences such as [e,combining-acute] and [e-acute]... However, 
again it isn't quite as simple as just mapping even direct sequences as 
combining marks have a priority which means their order after the base 
character doesn't matter. Similarly, you have things like the Hangul 
encoding of CJK (well, Korean really) ideographs - which compose and 
decompose in a specialized way (which is best done algorithmically as 
its generally faster than doing it with a lookup table which woul dbe 
huge).

Anyway, upshot - offset has to do a lot of work when done in isolation 
as the engine doesn't know you are repeatedly using it. However, you can 
reduce its workload significantly by normalizing and case folding the 
inputs to it first, and then setting caseSenstive and formSensitive to 
true:

    put normalizeText(toUpper(pNeedle), "nfd") into pNeedle
    put normalizeText(toUpper(pHaystack), "nfd") into pHaystack
    set the caseSensitive to true
    set the formSensitive to true

Here the caseSensitive property controls whether the engine case-folds 
strings before comparison, and formSensitive controls whether the engine 
normalizes strings before comparison. They can be set independently 
(although I must confess I'd have to double check the utility of 
caseSensitive false but formSensitive true - for reasons alluded to 
above!).

Note here I've used "NFD" which is the fully decomposed normalization 
form (which means no composed characters such as e-acute will appear), 
and upper-casing as folding - which should cover both the issues with 
doing caseless comparison on composed sequences, as well as edge cases 
like ß. There's a couple of engine additions which would be useful here 
- one which gives direct access to the 'case-folding' unicode does, and 
one which returns a string pre-processed by the settings of 
case/formSensitive.

With the above (modulo bugs in offset and friends - which appear to be 
rearing their heads based on Geoff's comments), then offset should run 
much quicker.

Warmest Regards,

Mark.

P.S. I did notice an issue with toUpper/toLower recently - when the 
input string is Unicode then it applies the 'full' Unicode rules, 
including the German ß mapping; if however the input is native then it 
does not. For backwards compatibility sake (and the fact that most 
people don't expect uppercasing/lowercasing to change the number of 
characters in a string), it should probably only use simple rules - 
unless you explicitly ask it to. Monte did some work last week to start 
adding a second 'locale' parameter to control the precise (complex) 
mappings (casing behavior is actually defined by language/region - not 
by character - e.g. Turkish/Azeri both have a dotted and dotless I - and 
there is variation as to what happens when one or other is uppercased, 
based on what precise language is being represented).

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list