Repeat For Each Iteration Order (was Re: use-livecode Digest, Vol 169, Issue 18)
Mark Waddingham
mark at livecode.com
Thu Oct 12 13:16:02 EDT 2017
On 2017-10-12 18:38, Peter Reid via use-livecode wrote:
> I agree that the redundant indexing is an expensive approach, however
> I have found that this abnormal structure of a repeat for each loop
> can be a lot faster than the other loop forms in some circumstances.
>
> I can't find the reference that first highlighted to me the lack of
> guaranteed sequence of chunks, but I've assumed this was a restriction
> for some time. It would be great if this is incorrect as I've got some
> heavyweight looping in several of my apps that would benefit from
> this!
>
> Can anyone from LiveCode give a definitive answer to this please?
Yes - the previous posts are correct.
If you are using 'repeat for each' to iterate over something which has
an order, then the iteration will follow that order.
Specifically:
- string chunks (char, item, line, word, trueWord, sentence,
paragraph, token, codepoint, codeunit) are all ordered
- byte chunks are ordered
- key chunks are unordered (they use the hash-order of the array,
which can change arbitrarily even within a session)
- element chunks are ordered for arrays which are sequences*,
unordered for all other arrays
* An array is a sequence if: it has only integer keys, and keys range
from 1 up to the number of elements in the array.
e.g. If the keys are 1, 2, 3, 4 - then that is a sequence; 0, 1, 2, 3
is not; 1, 3, 4 is not.
Just to reiterate something Richard mentioned - the performance
advantage of a 'repeat for each' loop is completely lost if you use a
normal chunk within the loop to fetch another part (or the same part!)
of the container:
put 0 into tIndex
repeat for each word tWord in tString
add 1 to tIndex
put word somefunc(tIndex) of tString into tSameWord
...
end repeat
This is because the chunk evaluation must step through somefunc(tIndex)
words in the string *again*. (It takes N steps to access the N'th
string/byte chunk in a container - repeat for each gets its performance
advantage because it remembers where it got the previous chunk from so
it can fetch the next, so it is just 1 step each iteration).
The only types of access which are guaranteed not to have this N step
behavior are codeunit, byte and array accesses:
- a LiveCode string is (internally) an array of codeunits; which means
that it takes only one step to get any codeunit in it.
- a LiveCode binary string is (internally) an array of bytes; so it
only takes one step to get any byte from it
- LiveCode arrays are hash-tables, so you can look up any key in one
step
Note: If you are using the 'byte' chunk, make sure that the thing they
are accessing *is* actually a binary string before using it on them. The
conversion from string -> binary string (which exists for compatibility
with pre-7 scripts) will cause a performance bump.
In terms of the codeunit/codepoint/character chunks - the engine *tries*
to ensure it does as little work as possible when accessing these.
Internally, the engine sets flags on a string so that it can optimize
these lookups. In particular:
1) If a string contains only Unicode codepoints from the first 64k
Unicode codes *without* surrogates (they give access to everything above
64k) then codepoint access on that string will be 1 step.
2) If a string contains characters which are only composed of single
codepoints in the first 64k Unicode code *without* surrogates, then char
access on that string will be 1 step.
In particular, if the string you are processing actually came from a
native string; then (1) and (2) are guaranteed to be the case. However,
if you have arbitrary Unicode strings, then it won't generally be the
case:
a) any characters which have a Unicode code > 64k must be represented
by two codes < 64k (surrogate pairs); this means that the engine has to
step through the string codeunit by codeunit to count the codepoints so
it can find the one you asked for (N step again)
b) any characters which are composed of multiple codepoints are in the
same boat
Case (b) can arise quite often even in Roman script languages. For
example, in Unicode you can write e-acute as EITHER e,combining-acute
(two codepoints) OR as e-with-acute (one codepoint). For some languages,
most characters will be multiple codepoints - e.g. the Indic languages.
There are also a lot of esoteric languages (for scholarly / academic
purposes) which are entirely defined by Unicode codepoints > 64k.
Upshot is that if you are doing char by char string processing of
general Unicode strings, then try and use repeat for each char; and if
you *can't* do that for some reason, then try and work out how you can
do the processing by using the codeunit chunk.
However, as Knuth is famed for saying - "Premature optimization is the
root of all evil" - don't spend time trying to optimize an algorithm
until you have a working version and can intimately analyse where the
slow parts are. Trying to optimize ahead of time will likely just cause
a lot of time being spent needlessly, and potential hair loss when
figuring out why the (probably much more complicated code) doesn't work
quite right, or why it didn't produce the speed improvement you were
expecting!
Warmest Regards,
Mark.
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode
mailing list