How to find offsets in Unicode Text fast
Mark Waddingham
mark at livecode.com
Tue Nov 13 06:02:08 EST 2018
On 2018-11-13 11:37, Geoff Canyon via use-livecode wrote:
> I understand (generally) the complexity of comparison, but that's not
> the
> speed issue causing this discussion. Most of the proposed solutions are
> using nearly the same operators/functions for comparison, or at least
> the
> same comparison is being done. Instead, the problem is a Foolish Line
> Painter problem: with single-byte characters, finding all occurrences
> of
> one string in another by repeatedly using offset() with charsToSkip
> scales
> well; but with multi-byte characters, the penalty for repeatedly
> calculating out longer and longer skips is exponential.
The actual reason its not linear when you have Unicode strings (the
behavior
isn't exponential, btw - cubic, I think) is that to work out the
character
index in a Unicode string is a O(n) operation - in native strings its
O(1).
So, I revise what I said before, you actually need to use
codeunitOffset() with
formSensitive and caseSensitive set to true, with your input strings
appropriately
processed.
The only wrinkle with that is that you might get 'false-positive'
matches - i.e.
where a match is not on character boundaries:
e.g. [e, combining-acute] will be found in [e, combining-acute,
combining-grave]
even though the latter is *not* the former as characters (but is as
codeunits).
Using textEncode("UTF-32") / binary offset will cause more
false-positives than that
(as you've mentioned before) as it could find matches which are not on a
4 byte boundary.
Warmest Regards,
Mark.
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode
mailing list