How to find offsets in Unicode Text fast

Mark Waddingham mark at livecode.com
Tue Nov 13 06:02:08 EST 2018


On 2018-11-13 11:37, Geoff Canyon via use-livecode wrote:
> I understand (generally) the complexity of comparison, but that's not 
> the
> speed issue causing this discussion. Most of the proposed solutions are
> using nearly the same operators/functions for comparison, or at least 
> the
> same comparison is being done. Instead, the problem is a Foolish Line
> Painter problem: with single-byte characters, finding all occurrences 
> of
> one string in another by repeatedly using offset() with charsToSkip 
> scales
> well; but with multi-byte characters, the penalty for repeatedly
> calculating out longer and longer skips is exponential.

The actual reason its not linear when you have Unicode strings (the 
behavior
isn't exponential, btw - cubic, I think) is that to work out the 
character
index in a Unicode string is a O(n) operation - in native strings its 
O(1).

So, I revise what I said before, you actually need to use 
codeunitOffset() with
formSensitive and caseSensitive set to true, with your input strings 
appropriately
processed.

The only wrinkle with that is that you might get 'false-positive' 
matches - i.e.
where a match is not on character boundaries:

e.g. [e, combining-acute] will be found in [e, combining-acute, 
combining-grave]
even though the latter is *not* the former as characters (but is as 
codeunits).

Using textEncode("UTF-32") / binary offset will cause more 
false-positives than that
(as you've mentioned before) as it could find matches which are not on a 
4 byte boundary.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list