OMG text processing performance 6.7 - 9.5

Neville neville.smythe at optusnet.com.au
Wed Feb 5 18:41:17 EST 2020


Richard, here is a link to my test stack

https://www.dropbox.com/sh/bbpe12p8bf56ofe/AADbhV2LavLP4Y3CZ8Ab8NGia?dl=0 <https://www.dropbox.com/sh/bbpe12p8bf56ofe/AADbhV2LavLP4Y3CZ8Ab8NGia?dl=0>

The LesMiserables.txt file is included for convenience; it should be placed in your Documents directory. The algorithms are all in the script for the `Run` button.

I am still mystified that the character offset searches give the same number for each hit for the utf8 text as for the raw text. Surely `char x of theUTF8Text` returns the unicode character at offset x, `char x of theRawText` returns the 8-bit ascii character of the raw text? How then can x be the same for the corresponding hit, when I know there are some multibyte unicode characters in the text (eg e-acute in Miserables)? Indeed just what does textDecode(theRawText,`UTF-8`) do, does it modify the actual text at all or just set a property flag?

Another mystery: I decided to extend the search algorithms by adding in matchChunk. In this case I use the regular expression `(?m)(?i)(Valjean)` to get the start and end offsets of the first match, and then truncate the initial section as per Parse2. As expected this search is much slower than any of the others on the raw text, it has a lot more to do. I then expected to get around the same time for the search on utf8 text rather than an exponentially worse time, since matchChunk is presumably encoding-blind. But it is actually 15% faster than on the raw text, in fact it is the fastest for finding offsets of all the algorithms if you must* search utf8 text ! How can this be? I don’t believe the utf8 text is 15% smaller than the raw text!

searches on raw text
matchChunk    3059 ms
filter                      16 ms
parse0                  10 ms
parse1                    8 ms
parse3              2244 ms
parse2                671 ms
parse4                682 ms

searches on utf-8 text
matchChunk utf8      2492 ms
filter utf8                   1954 ms
parse0 utf8               3788 ms
parse1 utf8           223254 ms
parse3 utf8           634423 ms
parse2 utf8               3409 ms
parse4 utf8               7166 ms

*As I mentioned in most case character offset searching the raw text should be OK if you are searching for 7-bit ascii strings of length say>2. But I think the lineOffset and filter operations could give false positives, if there is a multibyte character contains OD as a component byte in the text.

Neville





More information about the use-livecode mailing list