OMG text processing performance 6.7 - 9.5

Richard Gaskin ambassador at fourthworld.com
Wed Feb 5 00:48:31 EST 2020


Super thorough work there, Neville.  Thanks.

Could I trouble you to post code listings for the various algos?

I'd like to try them on my MBOX archives, and they may also be useful 
for others looking for parsing routines in the archives.

-- 
  Richard Gaskin
  Fourth World Systems


Neville wrote:> Just for interest, and to see just how slow lineOffset 
is, I added a couple of more tests to the search for occurrences of 
“Valjean” in the Gutenberg English translation of Les Miserables. I also 
wanted find how filter performs.
> 
> The searches were first applied to the raw binary text as read from the utf-8 encoded file, without using textDecode; then on the text utf-8 decoded
> 
> Parse 0 : using itemdelimiter  ‘Valjean’ (case insensitive)
> Parse 1: using offset with skips
> Parse 2: using offset, truncating the text and 0 skip
> Parse 3: use lineOffset with skips
> Parse 4: use lineOffset, truncating the text and 0 skip
> filter: use filter to find lines containing '*Valjean*'
> 
> Parse 1 and 2 produced 1120 hits. Parse 0 gave 1121 hits, the extra one being a false positive at the end of the file, which needs to be accounted for in an implementation. I was slightly surprised that the character offsets produced were the same for raw and for utf-8 text, I guess I was expecting the latter to give the unicode character offset. Parse 3 and 4 and filter all output 1099 lines. 
> 
> Results:
> 
> searches on raw text:
> 
> parse0      11 ms
> parse1        9 ms
> parse2    751 ms
> parse3  2551 ms
> parse4    753 ms
> filter          16 ms
> 
> searches on utf-8 text:
> 
> parse0        4386 ms
> parse1    224367 ms
> parse2        3461 ms
> parse3    636554 ms —— !!!!
> parse4        7242 ms
> filter            2258 ms
> 
> So for long texts it is best to use raw binary text and search with character offset(pSt,pSrc,skip) [Parse 1]. If you have to search on utf-8 encode text then use Parse 2, deleting initial sections of the text as you go. Never use lineOffset (except for small text) even if that means extra code to find line endings on either side of the character offset when you really want the found line. If you don’t actually need the offset of the hits in the original file - for example for editing the original - then filter is the fastest on long text and just as fast on short text, but depending on your needs you probably have to do another search on the filtered text; but this would be a viable approach if the number of lines produced is itself small.
> 
> Neville
> 





More information about the use-livecode mailing list