How to find offsets in Unicode Text fast
Mark Waddingham
mark at livecode.com
Tue Nov 13 02:07:53 EST 2018
On 2018-11-13 06:35, Geoff Canyon via use-livecode wrote:
> I didn't realize until now that offset() simply fails with some unicode
> strings:
>
> put offset("a","↘𠜎qeiuruioqeaaa↘𠜎qeiuar",13) -- puts 0
>
> On Mon, Nov 12, 2018 at 9:17 PM Geoff Canyon <gcanyon at gmail.com> wrote:
>
>> A few things:
>>
>> 1. It seems codepointOffset can only find a single character? So it
>> won't work for any search for a multi-character string?
>> 2: codepointOffset seems to work differently for multi-byte characters
>> and
>> regular characters:
>>
>> put codepointoffset("e","↘ndatestest",6) -- puts 3
>> put codepointoffset("e","andatestest",6) -- puts 9
>>
>> 3: It seems that when multi-byte characters are involved,
>> codepointOffset
>> suffers from the same sort of slow-down as offset does. For example,
>> in a
>> 145K string with about 20K hits for a single character, a simple
>> codepointOffset routine (below) takes over 10 seconds, while the
>> item-based
>> routine takes about 0.3 seconds for the same results.
There is something 'funky' going on with the *offset functions - even
taking into account that codeunitOffset/codepointOffset/byteOffset
return an absolute position rather than relative - I noticed something
similar the other day, I'll endeavour to take a look.
Regardless of any gremlins, the speed difference (although I've not seen
the 'item-based' routine - so don't quite know what that is) is due to
the fact that in order to compare text you need to do a lot of
processing.
Unicode is a multi-code representation of text *regardless* of whatever
encoding: a single (what humans understand to be) character can be
composed of multiple codes. For example, e-acute can be represented as
[e-acute] or [e, combining-acute]. Indeed, Unicode allows an arbitrary
sequence of combining marks to be attributed to a base character
(whether your text rendering system can deal with such things is another
matter). Some languages rely entirely on multi-code sequences - e.g. the
Indic languages; and more recently, Emoji use various 'variation
selectors' to allow variation in the base set of emoji - which are
composed of multiple codes.
Therefore, the only difference between UTF-8, UTF-16 and UTF-32 as a
chosen representation is a balance between density, and storage
requirement based on language - processing wise they all have to do the
same 'yoga' to 'work' correctly when you are doing text processing:
- UTF8: preserves the NUL byte (important for C systems), ASCII is
1-byte per code, European/Roman languages are generally at most 2-bytes
per code, every other language 3-4 bytes per code (i.e. it heavily
penalises Asian languages).
- UTF16: does not preserve the NUL byte (as its a sequence of 16-bit
unsigned integers), most languages in common use today (including Indic
and CJK) are almost entirely encoded as 2-bytes per code, with some
interspersion with 4-bytes per code (via the 'surrogate pair'
mechanism).
- UTF32: does not preserve the NUL byte (as its a sequence of 32-bit
unsigned integers), all languages require 4 bytes per code, it wastes on
average 15-bits per code (as almost all unicode codes currently defined
require a maximum of 17-bits).
Indeed, in a fully correct *general* implementation of Unicode
processing, UTF32 will perform (on average) half as fast as a
implementation based on UTF16, and UTF8 will only perform better than
UTF16 *if* the majority of your texts are ASCII with occasional
non-ASCII. (The latter is generally the case for European/Roman
languages, but if you involve the Asian languages then it will, on
average, be around 1/3rd slower).
[ This observation is based on the fact that all Unicode text processes
- if implemented in the most efficient way - will end up being memory
bound on modern processors and as such average speed will be based on
how many bytes the input string / output string take up ].
So, the reason 'offset' appears slow is that it is doing a lot more work
than you think. There are two principal processes which need to be
applied to Unicode text in order to do any sort of comparison:
- normalization
- case-folding
Case-folding applies if you want to do caseless rather than
case-sensitive comparison. It isn't *quite* upper-casing or
lower-casing, but in fact a mapping from char->upper_char->lower_char so
that differences in case are removed. Case-folding is a little more
complex than just code->code mappings - for example ß in German maps to
SS in full generality. Also there are some hairy edge cases with
'composed unicode characters' (particularly, but not exclusively when
considering polytonic Greek).
Normalization is the process of ensuring that two sequences of codes are
mapped to canonically equivalent sequences of codes - it is *mostly*
orthogonal to case-folding (at least if you normalize to the decomposed
form). Basically the process of normalization discards the difference
between sequences such as [e,combining-acute] and [e-acute]... However,
again it isn't quite as simple as just mapping even direct sequences as
combining marks have a priority which means their order after the base
character doesn't matter. Similarly, you have things like the Hangul
encoding of CJK (well, Korean really) ideographs - which compose and
decompose in a specialized way (which is best done algorithmically as
its generally faster than doing it with a lookup table which woul dbe
huge).
Anyway, upshot - offset has to do a lot of work when done in isolation
as the engine doesn't know you are repeatedly using it. However, you can
reduce its workload significantly by normalizing and case folding the
inputs to it first, and then setting caseSenstive and formSensitive to
true:
put normalizeText(toUpper(pNeedle), "nfd") into pNeedle
put normalizeText(toUpper(pHaystack), "nfd") into pHaystack
set the caseSensitive to true
set the formSensitive to true
Here the caseSensitive property controls whether the engine case-folds
strings before comparison, and formSensitive controls whether the engine
normalizes strings before comparison. They can be set independently
(although I must confess I'd have to double check the utility of
caseSensitive false but formSensitive true - for reasons alluded to
above!).
Note here I've used "NFD" which is the fully decomposed normalization
form (which means no composed characters such as e-acute will appear),
and upper-casing as folding - which should cover both the issues with
doing caseless comparison on composed sequences, as well as edge cases
like ß. There's a couple of engine additions which would be useful here
- one which gives direct access to the 'case-folding' unicode does, and
one which returns a string pre-processed by the settings of
case/formSensitive.
With the above (modulo bugs in offset and friends - which appear to be
rearing their heads based on Geoff's comments), then offset should run
much quicker.
Warmest Regards,
Mark.
P.S. I did notice an issue with toUpper/toLower recently - when the
input string is Unicode then it applies the 'full' Unicode rules,
including the German ß mapping; if however the input is native then it
does not. For backwards compatibility sake (and the fact that most
people don't expect uppercasing/lowercasing to change the number of
characters in a string), it should probably only use simple rules -
unless you explicitly ask it to. Monte did some work last week to start
adding a second 'locale' parameter to control the precise (complex)
mappings (casing behavior is actually defined by language/region - not
by character - e.g. Turkish/Azeri both have a dotted and dotless I - and
there is variation as to what happens when one or other is uppercased,
based on what precise language is being represented).
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode
mailing list