OMG text processing performance 6.7 - 9.5
Neville
neville.smythe at optusnet.com.au
Wed Feb 5 18:41:17 EST 2020
Richard, here is a link to my test stack
https://www.dropbox.com/sh/bbpe12p8bf56ofe/AADbhV2LavLP4Y3CZ8Ab8NGia?dl=0 <https://www.dropbox.com/sh/bbpe12p8bf56ofe/AADbhV2LavLP4Y3CZ8Ab8NGia?dl=0>
The LesMiserables.txt file is included for convenience; it should be placed in your Documents directory. The algorithms are all in the script for the `Run` button.
I am still mystified that the character offset searches give the same number for each hit for the utf8 text as for the raw text. Surely `char x of theUTF8Text` returns the unicode character at offset x, `char x of theRawText` returns the 8-bit ascii character of the raw text? How then can x be the same for the corresponding hit, when I know there are some multibyte unicode characters in the text (eg e-acute in Miserables)? Indeed just what does textDecode(theRawText,`UTF-8`) do, does it modify the actual text at all or just set a property flag?
Another mystery: I decided to extend the search algorithms by adding in matchChunk. In this case I use the regular expression `(?m)(?i)(Valjean)` to get the start and end offsets of the first match, and then truncate the initial section as per Parse2. As expected this search is much slower than any of the others on the raw text, it has a lot more to do. I then expected to get around the same time for the search on utf8 text rather than an exponentially worse time, since matchChunk is presumably encoding-blind. But it is actually 15% faster than on the raw text, in fact it is the fastest for finding offsets of all the algorithms if you must* search utf8 text ! How can this be? I don’t believe the utf8 text is 15% smaller than the raw text!
searches on raw text
matchChunk 3059 ms
filter 16 ms
parse0 10 ms
parse1 8 ms
parse3 2244 ms
parse2 671 ms
parse4 682 ms
searches on utf-8 text
matchChunk utf8 2492 ms
filter utf8 1954 ms
parse0 utf8 3788 ms
parse1 utf8 223254 ms
parse3 utf8 634423 ms
parse2 utf8 3409 ms
parse4 utf8 7166 ms
*As I mentioned in most case character offset searching the raw text should be OK if you are searching for 7-bit ascii strings of length say>2. But I think the lineOffset and filter operations could give false positives, if there is a multibyte character contains OD as a component byte in the text.
Neville
More information about the use-livecode
mailing list