Slow stack problem
Mark Waddingham
mark at livecode.com
Fri Jun 28 08:02:13 EDT 2024
On 2024-06-28 11:15, Neville Smythe via use-livecode wrote:
> In my last epistle I mentioned the repeat loop had only 32 iterations.
> Much more relevant of course is the inner loop on the number of lines
> of the data variable fff. In this case fff had 1760 lines. So the total
> possible number of iterations was around 30000 to 50000, getting up
> there but still well within LC capability.
>
> I tried operating the loop on just the first 100 lines of fff. The
> repeats took 0.006 seconds (impressive).
>
> Then just the first 300 lines. Timing was 0.016 seconds, approximately
> linear increase as expected
Are you sure a linear increase is expected?
> Then the first 500 lines. Timing is now 6.43 seconds. Something very
> odd there, that’s beyond exponential increase.
>
> And on the last 500 lines, timing was 0.135 seconds (Aha !!!)
I take it you tested this by doing the loop where fff was just line 1 to
300, then just line 1 to 500 and then just -500 to -1?
My guess here is that the first 300 lines do not have a unicode
character, there is somewhere in the next 200, and there are none in the
last 500.
> This would seem to point to matchChunk having indigestion over
> something in the middle of the text data. The data is 98% plain ascii,
> but quite likely it has a few Unicode characters: I have taken a whole
> lot of time to convert my legacy app to accept Asian and European
> Unicode personal names.
All regex matches in LC are Unicode (under the hood) - so thinking this
is regex related is a red-herring. Running a regex on Unicode text, is
no different from on native text - its just its dealing with 16-bit
units rather than 8-bit (regex is generally a linear operation - it
takes time proportional, roughly to length of pattern * length of string
- well, as long as you don't use any backtracking type features).
The issue here is the assumption that your code is doing something
linear...
It *looks* linear because your code is only doing two nested repeat
loops - so from the point of view of lines of script its iterating the
central loop roughly 'the number of lines of indexList * the number of
lines of fff' - however the engine is doing a lot more work.
The central loop is doing (paraphrased);
repeat with x = 1 to N
get line x of jjj
end repeat
This means the engine is searching for a line delimiter not N times, but
sum(1, ..., N) times (which is N*(N-1)/2 - i.e. N^2 roughly).
Processing 300 lines will take the engine about 45000 steps, processing
500 lines will take the engine 125000 steps, processing 1760 lines will
take the engine > 1,500,000.
This means that a small change in how long it takes to search for a line
delimiter (which is the fundamental operation here) makes a big
difference to how long doing 1000's of them will take - and searching a
string containing unicode is a fair bit slower than searching a string
which contains only native characters.
Fortunately there is an easy solution here - use an array so you get
random access to the lines of fff:
split fff by return
repeat with i=2 to the number of lines of indexList
put line (i+1) of indexList into which
put "(^[0-9-]*\t" & which & ")" into regX
put false into found
repeat with k=lastFound+1 to the number of elements of fff
put matchChunk(fff[k],regX,pos1,pos2) into found
if found then exit repeat
end repeat
if found then
put k into lastFound
end if
put lastFound into item i of mylineNumbers
end repeat
This should do exactly the same thing but SUBSTANTIALLY faster whether
your fff variable contains native or unicode characters.
Warmest Regards,
Mark.
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Build Amazing Things
More information about the use-livecode
mailing list