Slow stack problem

Sat Jun 29 03:53:58 EDT 2024

Thanks Mark for pointing me (as ever) in the right direction, away from the red herring but to the hidden inner loop entailed in evaluating

line k of fff

A bit embarrassing because I was party to the discussion some time ago about the slowness of lineOffset when working with unicode text.

And thanks for the neat array trick which I wouldn’t have thought of.

There is indeed just one line in the text which contains a single Unicode character. Without that character evidently the variable fff would internally be an ascii (or native?) string, but otherwise is entirely utf16 and we hit the LineOffset problem.

But I am still bemused. 

Is it not the case that the processing time for looping over the number of lines and getting the k-th line in each iteration, for some arbitrary k, going to be

Order(N^2) * C * T

where

N = the number of lines in the text

C = the number of codepoints in each line 

T =  the (average) time for processing each codepoint to check for a return character

Now N and C are the same whether the text is ascii or unicode

Test 1
If I get rid of the red herring by replacing the matchChunk call with a simple

put line k of fff into x; put true into found — as if matchChunk always finds a match on the first line tested so that I am timing just the getting of lines endings:

The time for processing plain ascii is.  0.008 seconds
The time for processing unicode is.     0.84 seconds

Which would appear to mean that processing unicode codepoints for the apparently simple task of checking for return takes 100 times as much time as processing ascii. That seems excessive, to put it mildly!

Test 2
With the original code using matchChunk, which of course is going to have its own internal loop on code points so multiply by another 8 (it only searches the first few characters)
and will not always return true so more lines must be processed — but again these are same multipliers whether ascii or unicode

Plain ascii takes   0.07 seconds
Unicode takes    19.9 seconds, a multiplier of nearly 300. — I can easily believe matchChunk takes 3 times as long to process unicode as ascii, this is the sort of factor I would have expected in Test 1.

OK Mark, hit me again, I am a glutton for punishment, what is wrong with this analysis?

Neville Smythe