Something about speed

David Vaughan dvk at dvkconsult.com.au
Sun Apr 25 04:16:51 EDT 2004


I have been engaged in an off-list conversation with Dar Scott, mostly 
around his solution to the Yavelow test. In the course of it we found 
out or re-confirmed a few things about Rev characteristics and I pass 
them on here for anyone who is interested. The information comprises a 
somewhat random collection.

On a file of decent size it is faster to strip all non-alphanumerics 
(the set ".,!#$%&'()*+-/:;<=>?@[\^_`{|}~]" & quote) and use -replace- 
than it is to use a single -matchText- on every word in a -repeat for 
each- loop. The code would be something like:

constant unwanted=".,!#$%&'()*+-/:;<=>?@[\^_`{|}~]"
put quote & unwanted into stripThisLot
repeat for each char i in stripThisLot
   replace i with space in whatever
end repeat

  If you can limit to a subset (most text uses less than half that 
punctuation) then of course you save some more time. Stripping 
non-alphanumeric characters saves a lot of trouble with word handling 
as otherwise constructions like "elsewhere!--Oh!" might confuse matters 
by appearing as one word.

Handling items is about 20% faster than handling words. This advantage 
may be lost where you have to account for empty items, whereas words 
are never empty. For example, "elsewhere,,,Oh," vs "elsewhere   Oh " 
have five items and two words respectively. Three of the items are 
empty.

The -replace- command (replace x with y in z) slows down with larger 
data sets in a highly predictable way, defined as:
log(time) is a linear function of log(length(data))
To make the point, the log formula fits a regression of the 
experimental data to a level of 99.96% on data of length 1000 to 
300,000 characters. Note that in this instance the log formula is not 
the same statement as "time is a linear function of length(data)".

Array insertion time is almost constant with array size, in arrays from 
ten to a million unique keys. Therefore, it is clearly faster to work 
on one large array than on multiple smaller arrays and then use union 
at the end. Sometimes, breaking a job into smaller pieces slows it 
down.

Stripping alphanumerics gives different answers. This may or may not be 
what you want. For example, below is Dar's code with minor 
modifications by me and some changes for consistency with Brian 
Yennie's code. It runs about 20% slower than Brian's solution and 
produces 15% more matches ;-). Not setting caseSensitive true would add 
another match. It all depends on what you want.

regards
David

constant nonWords=".,!#$%&'()*+-/:;<=>?@[\^_`{|}~]" --Just so I 
remember the full set

on mouseUp
   local sTicks, sMS, tt, stList, unwanted, i, wordNumber, prevWord, w, 
stL
   local match, currentPos, numberMatches

   put empty into field "TheResults"
   set cursor to watch
   put the milliseconds into sMS
   lock screen

   -- Using direct data as we already know the data load takes 18mS
   put field "TargetText" into tt
   put field "SearchTextList" into stList

   set caseSensitive to true
   -- Stripping this sub-set takes 6% of run time.
   put quote & ".,!&'()*-:;?" into unwanted
   repeat for each char i in unwanted
     replace i with space in tt
     replace i with space in stList
   end repeat
   -- Create set of single words and set of position sets for double 
words
   -- Following repeat loop takes just over 80% of run time
   put 1 into wordNumber
   put "$$$" into prevWord
   repeat for each word w in tt
     put true into boolA[w]
     put wordNumber & space after ttA[prevWord,w]
     add 1 to wordNumber
     put w into prevWord
   end repeat

   -- Search
   -- Just over 13% of run time
   repeat for each line stLine in stList
     put empty into stL
     repeat for each word w in stLine
       put w & space after stL
     end repeat
     if the number of words in stL = 1 then
       put boolA[word 1 of stL] into match
     else
       repeat with i = 2 to the number of words in stL
         put (word (i-1) of stL),(word i of stL) into w
         if i=2 then
           put ttA[w] into currentPos
         else
           -- This function absorbs 11 points of the 13%
           put intersectPosSet(currentPos, ttA[w]) into currentPos
         end if
         put the number of words in currentPos into numberMatches
         if numberMatches is 0 then exit repeat
       end repeat
       put numberMatches is not zero into match
     end if
     if match then
       put stLine & cr after MatchList
     end if
   end repeat
   put matchList into field "TheResults"
   put timeit(sMS, "Check Matches") into sMS
   -- Sort lines and counting them is outside the timing
   sort lines of field "TheResults" ascending text
   put (the number of lines of field "TheResults") & lf before field 
"TheResults"
end mouseUp

-- modified intersection function that adds one to elements of the first
function intersectPosSet cPos ttPos
   local newPos, p1, p2
   put empty into newPos
   put (word 1 of cPos)+1 into p1
   put word 1 of ttPos into p2
   repeat forever -- until (cPos is empty) or (ttPos is empty)
     if p1 < p2 then
       delete word 1 of cPos
       if cPos is empty then return newPos
       put (word 1 of cPos)+1 into p1
     else
       if p1 = p2 then
         put p1 & space after newPos
         delete word 1 of cPos
         if cPos is empty then return newPos
         delete word 1 of ttPos
         if ttPos is empty then return newPos
         put (word 1 of cPos)+1 into p1
         put word 1 of ttPos into p2
       else
         delete word 1 of ttPos
         if ttPos is empty then return newPos
         put word 1 of ttPos into p2
       end if
     end if
   end repeat
   return newPos
end intersectPosSet

function timeit tMS,tMSG
   get the milliseconds - tMS
   put it && "ms:" && tMSG && "(" & round(it/16.67) && "ticks)" & return 
before field "SpeedRecords"
   return the milliseconds
end timeit



More information about the use-livecode mailing list