Slow stack problem

Fri Jun 28 08:02:13 EDT 2024

On 2024-06-28 11:15, Neville Smythe via use-livecode wrote:
> In my last epistle I mentioned the repeat loop had only 32 iterations. 
> Much more relevant of course is the inner loop on the number of lines 
> of the data variable fff. In this case fff had 1760 lines. So the total 
> possible number of iterations was around 30000 to 50000, getting up 
> there but still well within LC capability.
> 
> I tried operating the loop on just the first 100 lines of fff. The 
> repeats took 0.006 seconds (impressive).
> 
> Then just the first 300 lines. Timing was 0.016 seconds, approximately 
> linear increase as expected

Are you sure a linear increase is expected?

> Then the first 500 lines. Timing is now 6.43 seconds. Something very 
> odd there, that’s beyond exponential increase.
> 
> And on the last 500 lines, timing was 0.135 seconds (Aha !!!)

I take it you tested this by doing the loop where fff was just line 1 to 
300, then just line 1 to 500 and then just -500 to -1?

My guess here is that the first 300 lines do not have a unicode 
character, there is somewhere in the next 200, and there are none in the 
last 500.

> This would seem to point to matchChunk having indigestion over 
> something in the middle of the text data. The data is 98% plain ascii, 
> but quite likely it has a few Unicode characters: I have taken a whole 
> lot of time to convert my legacy app to accept Asian and European 
> Unicode personal names.

All regex matches in LC are Unicode (under the hood) - so thinking this 
is regex related is a red-herring. Running a regex on Unicode text, is 
no different from on native text - its just its dealing with 16-bit 
units rather than 8-bit (regex is generally a linear operation - it 
takes time proportional, roughly to length of pattern * length of string 
- well, as long as you don't use any backtracking type features).

The issue here is the assumption that your code is doing something 
linear...

It *looks* linear because your code is only doing two nested repeat 
loops - so from the point of view of lines of script its iterating the 
central loop roughly 'the number of lines of indexList * the number of 
lines of fff' - however the engine is doing a lot more work.

The central loop is doing (paraphrased);

    repeat with x = 1 to N
      get line x of jjj
    end repeat

This means the engine is searching for a line delimiter not N times, but 
sum(1, ..., N) times (which is N*(N-1)/2 - i.e. N^2 roughly).

Processing 300 lines will take the engine about 45000 steps, processing 
500 lines will take the engine 125000 steps, processing 1760 lines will 
take the engine > 1,500,000.

This means that a small change in how long it takes to search for a line 
delimiter (which is the fundamental operation here) makes a big 
difference to how long doing 1000's of them will take - and searching a 
string containing unicode is a fair bit slower than searching a string 
which contains only native characters.

Fortunately there is an easy solution here - use an array so you get 
random access to the lines of fff:

    split fff by return
    repeat with i=2 to the number of lines of indexList
       put line (i+1) of indexList into which

       put "(^[0-9-]*\t" & which & ")" into regX

       put false into found
       repeat with k=lastFound+1 to the number of elements of fff
          put matchChunk(fff[k],regX,pos1,pos2) into found

          if found then exit repeat
       end repeat

       if found then
          put k into lastFound
       end if

       put lastFound into item i of mylineNumbers
    end repeat

This should do exactly the same thing but SUBSTANTIALLY faster whether 
your fff variable contains native or unicode characters.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Build Amazing Things