SC, Rev,and RB speed test

David Vaughan dvk at dvkconsult.com.au
Sat Apr 17 08:23:16 EDT 2004


On 17/04/2004, at 21:46, Brian Yennie <briany at qldlearning.com> wrote:

> It's amazing how many different optimizations we can find based on the 
> type of data. There's the baseline brute force, there's mine which 
> takes a constant factor out (by shrinking the text), but would fail if 
> you had many different words throughout with the same suffixes (as odd 
> as that may seem). There were a couple of methods that fly through 
> redundant text. And there's your latest which really whoops it up on 
> large texts with certain length search strings.
>
> I wonder on yours, could there be some advantage to be gleaned from 
> the fact that longer searches can be broken down into shorter 
> sub-searches (i.e. 6 words = 3 words + 3 words)? Could be interesting 
> if there's a slick way to utilize that.
>
> From what I can tell, the algorithms roughly look like:
>
> Brute force: ~500 ticks Dual-G4 baseline, runs faster when there are 
> more matches
> Redundancy array methods: As little as a few ticks, but slower on 
> pseudo-random data
> Suffix indexing method: about 60-80% faster at eliminating matches and 
> 20-40% slower at finding matches than brute force
> 3-word indexing method: About 10% overhead with 99% speedup on 3 word 
> queries, same speed on others

and also
>
> One interesting thing to note here... many of the algorithms using 
> array indexing schemes give different results than the "brute" force. 
> They tend to miss compound "words", such as "I like to use 
> dashes--like this". Search for "I like to use dashes" and you may find 
> that "dashes--like" was thought of as a word. Same problem with 
> quotes.

Brian

Liked your token method, an approach I had not considered. I have not 
tackled the question of re-grouping the search text because at this 
stage I would need yet more information about Chris' intended 
application. I have assumed he is finding characteristic triplets in 
text to use in author-identification or to compare classes or styles of 
literature or some such. In these circumstances I doubt that 
punctuation matters. The only reason I stripped quotes in my version is 
that I noticed unbalanced quotes in the search text but in a presumed 
real application I would probably do it more comprehensively with 
something like:
repeat with i = 33 to 47
   replace NumToChar(i) in whateverText with space
end repeat -- and so on

Given that Chris' original SuperCard app takes over 15 seconds on the 
sample text while we now have it under a second, I suspect we can 
afford a few more bits of work to get the right technical result. I 
would like to see all of the prime strategies we have devised published 
on Chris Yavelow's web site, to prove the richness and versatility of 
RunRev as well as its speed. As Richard Gaskin said early on, there is 
more to an app than one speed test.

regards
David



More information about the use-livecode mailing list