OMG text processing performance 6.7 - 9.5
neville.smythe at optusnet.com.au
Tue Feb 4 19:12:21 EST 2020
The recent testing of the Parse1 and Parse2 algorithms I think must have been on ascii not utf-8 text
I tested on the English translation of Les Miserables, to ensure at least a sprinkling of multi-bite characters in the text, and a longish file: 3.4 MB. I tested for the search string ‘Valjean’ which obviously occurs very frequently.
The searches were first applied to the raw binary text as read from the utf-8 encoded file, without decoding; then on the text utf-8 decoded
Parse 0 : using itemdelimiter ‘Valjean’ (case insensitive)
Parse 1: using offset with skips
Parse 2: using offset, truncating the text and 0 skip
searches on raw text
parse0 10 ms
parse1 9 ms
parse2 708 ms
searches on utf-8text
parse0 4402 ms
parse1 225469 ms
parse2 3453 ms
The winner for long utf-8 text is Parse 2; for raw text Parse1 and Parse 0 are equivalent The results dramatically demonstrate the exponential decay in performance with long utf-8 text.
For most searches I would think one could use the raw text as long as one was searching for an ascii string, false positives where the string of single bytes occurs inside multibyte characters would be extremely unlikely.
More information about the use-livecode