SC, Rev,and RB speed test

Brian Yennie briany at qldlearning.com
Sat Apr 17 07:42:08 EDT 2004


One interesting thing to note here... many of the algorithms using  
array indexing schemes give different results than the "brute" force.  
They tend to miss compound "words", such as "I like to use dashes--like  
this". Search for "I like to use dashes" and you may find that  
"dashes--like" was thought of as a word. Same problem with quotes. On  
the other hand, doing a straight offset() search has the opposite  
problem- partial word matching. And thus you get something like "I am  
afraid I" giving a positive match for "I am afraid it".

Here's one more interesting script. It has a large amount of overhead,  
but it still beats brute force and uses tokens to get a pretty accurate  
yet flexible match:

By the nature of tokens, it will allow for extra spaces in the query,  
or missing spaces after punctuation. For example:

"by Jane Austen ( 1811 )"
matches
"by Jane Austen (1811)"
(and vice versa)

And
"Their estate was large, and"
matches
"Their estate was large,and"
(and vice versa)

But
"am afraid I"
does not match
"am afraid it"

This one only runs about 25% faster than brute force on the sample  
data, and has a fair amount of indexing overhead, but it's interesting!  
Note that switching the same script to use words instead of tokens  
loses some of the matching ability but grabs you a full 50% speedup  
over brute force. This algorithm should work equally well for a lot of  
different data sets, with the caveat that it takes a lot of overhead up  
front on large texts.

I also discovered one interesting nuggets- if you use "token" as a  
chunk type it will actually skip of code comments for you! So I suppose  
if you want to hide mysterious messages in the text...

on mouseUp
   local  
stList,theText,MatchList,sMS,sTicks,w,index,indexN,wordNum,saveLine,star 
tWord,wordPos,isMatch

   put fld "TargetText" into theText
   put "" into fld "TheResults"
   put field "SearchTextList" into stList
   put the milliseconds into sMS
   put ticks() into sTicks
   put 1 into wordNum

   replace quote with empty in theText
   replace "--" with empty in theText
   replace "##" with empty in theText
   replace "/*" with empty in theText

   replace quote with empty in stList
   replace "--" with empty in stList
   replace "##" with empty in stList
   replace "/*" with empty in stList

   repeat for each token w in theText
     put wordNum&comma after index[w]
     put w into indexN[wordNum]
     add 1 to wordNum
   end repeat

   repeat for each line stLine in stList
     put stLine into saveLine
     put token 1 of stLine into startWord
     delete token 1 of stLine
     repeat for each item wordPos in index[startWord]
       put 1 into isMatch
       repeat for each token w in stLine
         add 1 to wordPos
         if (indexN[wordPos] <> w) then
           put 0 into isMatch
           exit repeat
         end if
       end repeat
       if (isMatch = 1) then exit repeat
     end repeat
     if (isMatch = 1) then
       put saveLine & cr after MatchList
     end if
   end repeat

   put MatchList into fld "TheResults"
   put the milliseconds - sMS && "ms" & return && "(" & ticks()-sticks  
&& "ticks)" & return before cd fld "SpeedRecords"

end mouseUp



- Brian



More information about the use-livecode mailing list