OMG text processing performance 6.7 - 9.5

Neville neville.smythe at optusnet.com.au
Tue Feb 4 19:12:21 EST 2020


The recent testing of the Parse1 and Parse2  algorithms I think must have been on ascii not utf-8 text

I tested on the English translation of Les Miserables, to ensure at least a sprinkling of multi-bite characters in the text, and a longish file: 3.4 MB. I tested for the search string ‘Valjean’ which obviously occurs very frequently.

The searches were first applied to the raw binary text as read from the utf-8 encoded file, without decoding; then on the text utf-8 decoded

Parse 0 : using itemdelimiter  ‘Valjean’ (case insensitive)

Parse 1: using offset with skips

Parse 2: using offset, truncating the text and 0 skip

Results:

searches on raw text
parse0 10 ms
parse1 9 ms
parse2 708 ms

searches on utf-8text
parse0 4402 ms
parse1 225469 ms
parse2 3453 ms


The winner for long utf-8 text is Parse 2; for raw text Parse1 and Parse 0 are equivalent The results dramatically demonstrate the exponential decay in performance with long utf-8 text. 

For most searches I would think one could use the raw text as long as one was searching for an ascii string, false positives where the string of single bytes occurs inside multibyte characters would be extremely unlikely.

Neville






More information about the use-livecode mailing list