How to find offsets in Unicode Text fast
Geoff Canyon
gcanyon at gmail.com
Tue Nov 13 05:37:27 EST 2018
A lot of useful points here, thanks.
The caseSensitive vs. binary has been covered in the other discussion --
Monte said that by using offset() "the engine will convert your data to a
string and assume native encoding. This is probably why you are getting
some case insensitivity."
I understand (generally) the complexity of comparison, but that's not the
speed issue causing this discussion. Most of the proposed solutions are
using nearly the same operators/functions for comparison, or at least the
same comparison is being done. Instead, the problem is a Foolish Line
Painter problem: with single-byte characters, finding all occurrences of
one string in another by repeatedly using offset() with charsToSkip scales
well; but with multi-byte characters, the penalty for repeatedly
calculating out longer and longer skips is exponential.
The fastest proposed solutions so far, which both scale much more closely
to linearly with the size of the stringToSearch, are:
1. Set the item delimiter to the charsToFind and then use repeat for each
item on the stringToSearch. This comes with its own headaches, but supports
(lack of) case sensitivity.
2. Converting to UTF-32 binary, and using offset and charsToSkip. This
hasn't yet been corrected to use byteOffset and toUpper to property handle
case insensitivity, but once that is done, this will probably be the
preferred solution between the two -- the logic is simpler, and it's
several times faster.
Thanks,
Geoff
On Tue, Nov 13, 2018 at 12:27 AM Mark Waddingham via use-livecode <
use-livecode at lists.runrev.com> wrote:
> On 2018-11-13 01:06, Geoff Canyon via use-livecode wrote:
> > On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode <
> > use-livecode at lists.runrev.com> wrote:
> >
> >> I'm really confused that case-insensitive should work at all for
> >> UTF-16 or
> >> UTF-32;
>
> The caseSensitive (and formSensitive) properties only apply to strings
> *not* binary strings.
>
> The output of textEncode() is a binary string.
>
> The 'is' operator is overloaded - in strict order:
>
> left-empty 'is' right-ANY -- returns is-empty(right-ANY)
> left-ANY 'is' right-empty -- returns is-empty(left-ANY)
> left-array 'is' left-array -- compare as array
> left-number 'is' right-number -- compare as number
> left-numeric-[binary]-string 'is' right-numeric-[binary]-string --
> compare as number
> left-binary-string 'is' right-binary-string -- compare as binary
> strings
> left-any 'is' right-any -- compare as strings
>
> Also concatenation, put after and put before are overloaded:
>
> binary-string & binary-string -> binary-string
> string & ANY -> string
> ANY & string -> string
>
> put src-data after|before dst-data -> dst-data is binary-string
> put src-ANY after|before dst-ANY -> dst-ANY is string
>
> > This is so puzzling. I tried this code in a button:
> >
> > on mouseUp
> > put "Ѡ" into x
> > put "ѡ" into y
> > --put ("Ѡ" is "ѡ") && (x is y)
> > --exit mouseUp
> > put textencode("Ѡ","UTF-32") into xBig
> > put textencode("ѡ","UTF-32") into xSmall
> > repeat for each byte B in xBig
> > put B after yBig
> > end repeat
> > repeat for each byte B in xSmall
> > put B after ySmall
> > end repeat
> > put "Ѡ" into zBig
> > put "ѡ" into zSmall
> > put zBig into wBig
> > put zSamll into wSmall
> > put textencode(zBig,"UTF-32") into zBig
> > put textencode(zSmall,"UTF-32") into zSmall
> > put x into j
> > put y into k
> > set caseSensitive to false
> > put ("Ѡ" is "ѡ") && (xBig is xSmall) && (yBig is ySmall) && (zBig is
> > zSmall) && (wBig is wSmall) && (x is y) && (j is k)
> > end mouseUp
> >
> >
> > That puts: true false false false true true true
> >
> > Things to note:
> >
> > 1. "Ѡ" and "ѡ" are upper and lower case omega in cyrillic, 00000460 and
> > 00000461. Given the string literals, LC is happy to say they are the
> > same
> > (the first true)
> > 2. Put them in a variable, LC is happy to say they are the same
> > (the second-to-last true).
> > 3. Convert them to UTF-32 and LC no longer recognizes them as the same
> > (the
> > fourth boolean, false)
> > 4. Put the variables into other variables, and LC identifies them as
> > the
> > same (the last true)
>
> ("Ѡ" is "ѡ") is true because they are both strings
> (xBig is xSmall) is false because both sides are binary-strings (and so
> compare byte for byte)
> (yBig is ySmall) is false because both sides are binary-strings
> (zBig is zSmall) is false because you've textEncoded strings which
> produce binary-strings so both are binary strings
> (wBig is wSmall) is true because both sides are strings
> (x is y) is true because both sides are strings
> (j is k) is true because both sides are strings
>
> One could argue that 'is'/'is not' should never have been overloaded to
> do binary string comparison - and that should have perhaps been added as
> a separate operator (especially since binary strings are compared as
> numbers if numeric). With hindsight I'd probably agree as it is a slight
> discontinuity in terms of comparison with pre-7.
>
> Indeed, had we not added that overload then we would not be having this
> discussion - it would have been a similar discussion as used to come up
> a lot with comparing the output of compress() and other functions which
> have always produced binary data - and why comparisons seemed 'not as
> one would expect'.
>
> Warmest Regards,
>
> Mark.
>
> --
> Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
> LiveCode: Everyone can create apps
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list