SC, Rev,and RB speed test
Brian Yennie
briany at qldlearning.com
Sat Apr 17 07:42:08 EDT 2004
One interesting thing to note here... many of the algorithms using
array indexing schemes give different results than the "brute" force.
They tend to miss compound "words", such as "I like to use dashes--like
this". Search for "I like to use dashes" and you may find that
"dashes--like" was thought of as a word. Same problem with quotes. On
the other hand, doing a straight offset() search has the opposite
problem- partial word matching. And thus you get something like "I am
afraid I" giving a positive match for "I am afraid it".
Here's one more interesting script. It has a large amount of overhead,
but it still beats brute force and uses tokens to get a pretty accurate
yet flexible match:
By the nature of tokens, it will allow for extra spaces in the query,
or missing spaces after punctuation. For example:
"by Jane Austen ( 1811 )"
matches
"by Jane Austen (1811)"
(and vice versa)
And
"Their estate was large, and"
matches
"Their estate was large,and"
(and vice versa)
But
"am afraid I"
does not match
"am afraid it"
This one only runs about 25% faster than brute force on the sample
data, and has a fair amount of indexing overhead, but it's interesting!
Note that switching the same script to use words instead of tokens
loses some of the matching ability but grabs you a full 50% speedup
over brute force. This algorithm should work equally well for a lot of
different data sets, with the caveat that it takes a lot of overhead up
front on large texts.
I also discovered one interesting nuggets- if you use "token" as a
chunk type it will actually skip of code comments for you! So I suppose
if you want to hide mysterious messages in the text...
on mouseUp
local
stList,theText,MatchList,sMS,sTicks,w,index,indexN,wordNum,saveLine,star
tWord,wordPos,isMatch
put fld "TargetText" into theText
put "" into fld "TheResults"
put field "SearchTextList" into stList
put the milliseconds into sMS
put ticks() into sTicks
put 1 into wordNum
replace quote with empty in theText
replace "--" with empty in theText
replace "##" with empty in theText
replace "/*" with empty in theText
replace quote with empty in stList
replace "--" with empty in stList
replace "##" with empty in stList
replace "/*" with empty in stList
repeat for each token w in theText
put wordNum&comma after index[w]
put w into indexN[wordNum]
add 1 to wordNum
end repeat
repeat for each line stLine in stList
put stLine into saveLine
put token 1 of stLine into startWord
delete token 1 of stLine
repeat for each item wordPos in index[startWord]
put 1 into isMatch
repeat for each token w in stLine
add 1 to wordPos
if (indexN[wordPos] <> w) then
put 0 into isMatch
exit repeat
end if
end repeat
if (isMatch = 1) then exit repeat
end repeat
if (isMatch = 1) then
put saveLine & cr after MatchList
end if
end repeat
put MatchList into fld "TheResults"
put the milliseconds - sMS && "ms" & return && "(" & ticks()-sticks
&& "ticks)" & return before cd fld "SpeedRecords"
end mouseUp
- Brian
More information about the use-livecode
mailing list