Something about speed
David Vaughan
dvk at dvkconsult.com.au
Sun Apr 25 04:16:51 EDT 2004
I have been engaged in an off-list conversation with Dar Scott, mostly
around his solution to the Yavelow test. In the course of it we found
out or re-confirmed a few things about Rev characteristics and I pass
them on here for anyone who is interested. The information comprises a
somewhat random collection.
On a file of decent size it is faster to strip all non-alphanumerics
(the set ".,!#$%&'()*+-/:;<=>?@[\^_`{|}~]" & quote) and use -replace-
than it is to use a single -matchText- on every word in a -repeat for
each- loop. The code would be something like:
constant unwanted=".,!#$%&'()*+-/:;<=>?@[\^_`{|}~]"
put quote & unwanted into stripThisLot
repeat for each char i in stripThisLot
replace i with space in whatever
end repeat
If you can limit to a subset (most text uses less than half that
punctuation) then of course you save some more time. Stripping
non-alphanumeric characters saves a lot of trouble with word handling
as otherwise constructions like "elsewhere!--Oh!" might confuse matters
by appearing as one word.
Handling items is about 20% faster than handling words. This advantage
may be lost where you have to account for empty items, whereas words
are never empty. For example, "elsewhere,,,Oh," vs "elsewhere Oh "
have five items and two words respectively. Three of the items are
empty.
The -replace- command (replace x with y in z) slows down with larger
data sets in a highly predictable way, defined as:
log(time) is a linear function of log(length(data))
To make the point, the log formula fits a regression of the
experimental data to a level of 99.96% on data of length 1000 to
300,000 characters. Note that in this instance the log formula is not
the same statement as "time is a linear function of length(data)".
Array insertion time is almost constant with array size, in arrays from
ten to a million unique keys. Therefore, it is clearly faster to work
on one large array than on multiple smaller arrays and then use union
at the end. Sometimes, breaking a job into smaller pieces slows it
down.
Stripping alphanumerics gives different answers. This may or may not be
what you want. For example, below is Dar's code with minor
modifications by me and some changes for consistency with Brian
Yennie's code. It runs about 20% slower than Brian's solution and
produces 15% more matches ;-). Not setting caseSensitive true would add
another match. It all depends on what you want.
regards
David
constant nonWords=".,!#$%&'()*+-/:;<=>?@[\^_`{|}~]" --Just so I
remember the full set
on mouseUp
local sTicks, sMS, tt, stList, unwanted, i, wordNumber, prevWord, w,
stL
local match, currentPos, numberMatches
put empty into field "TheResults"
set cursor to watch
put the milliseconds into sMS
lock screen
-- Using direct data as we already know the data load takes 18mS
put field "TargetText" into tt
put field "SearchTextList" into stList
set caseSensitive to true
-- Stripping this sub-set takes 6% of run time.
put quote & ".,!&'()*-:;?" into unwanted
repeat for each char i in unwanted
replace i with space in tt
replace i with space in stList
end repeat
-- Create set of single words and set of position sets for double
words
-- Following repeat loop takes just over 80% of run time
put 1 into wordNumber
put "$$$" into prevWord
repeat for each word w in tt
put true into boolA[w]
put wordNumber & space after ttA[prevWord,w]
add 1 to wordNumber
put w into prevWord
end repeat
-- Search
-- Just over 13% of run time
repeat for each line stLine in stList
put empty into stL
repeat for each word w in stLine
put w & space after stL
end repeat
if the number of words in stL = 1 then
put boolA[word 1 of stL] into match
else
repeat with i = 2 to the number of words in stL
put (word (i-1) of stL),(word i of stL) into w
if i=2 then
put ttA[w] into currentPos
else
-- This function absorbs 11 points of the 13%
put intersectPosSet(currentPos, ttA[w]) into currentPos
end if
put the number of words in currentPos into numberMatches
if numberMatches is 0 then exit repeat
end repeat
put numberMatches is not zero into match
end if
if match then
put stLine & cr after MatchList
end if
end repeat
put matchList into field "TheResults"
put timeit(sMS, "Check Matches") into sMS
-- Sort lines and counting them is outside the timing
sort lines of field "TheResults" ascending text
put (the number of lines of field "TheResults") & lf before field
"TheResults"
end mouseUp
-- modified intersection function that adds one to elements of the first
function intersectPosSet cPos ttPos
local newPos, p1, p2
put empty into newPos
put (word 1 of cPos)+1 into p1
put word 1 of ttPos into p2
repeat forever -- until (cPos is empty) or (ttPos is empty)
if p1 < p2 then
delete word 1 of cPos
if cPos is empty then return newPos
put (word 1 of cPos)+1 into p1
else
if p1 = p2 then
put p1 & space after newPos
delete word 1 of cPos
if cPos is empty then return newPos
delete word 1 of ttPos
if ttPos is empty then return newPos
put (word 1 of cPos)+1 into p1
put word 1 of ttPos into p2
else
delete word 1 of ttPos
if ttPos is empty then return newPos
put word 1 of ttPos into p2
end if
end if
end repeat
return newPos
end intersectPosSet
function timeit tMS,tMSG
get the milliseconds - tMS
put it && "ms:" && tMSG && "(" & round(it/16.67) && "ticks)" & return
before field "SpeedRecords"
return the milliseconds
end timeit
More information about the use-livecode
mailing list