SC, Rev,and RB speed test
David Vaughan
dvk at dvkconsult.com.au
Sat Apr 17 08:23:16 EDT 2004
On 17/04/2004, at 21:46, Brian Yennie <briany at qldlearning.com> wrote:
> It's amazing how many different optimizations we can find based on the
> type of data. There's the baseline brute force, there's mine which
> takes a constant factor out (by shrinking the text), but would fail if
> you had many different words throughout with the same suffixes (as odd
> as that may seem). There were a couple of methods that fly through
> redundant text. And there's your latest which really whoops it up on
> large texts with certain length search strings.
>
> I wonder on yours, could there be some advantage to be gleaned from
> the fact that longer searches can be broken down into shorter
> sub-searches (i.e. 6 words = 3 words + 3 words)? Could be interesting
> if there's a slick way to utilize that.
>
> From what I can tell, the algorithms roughly look like:
>
> Brute force: ~500 ticks Dual-G4 baseline, runs faster when there are
> more matches
> Redundancy array methods: As little as a few ticks, but slower on
> pseudo-random data
> Suffix indexing method: about 60-80% faster at eliminating matches and
> 20-40% slower at finding matches than brute force
> 3-word indexing method: About 10% overhead with 99% speedup on 3 word
> queries, same speed on others
and also
>
> One interesting thing to note here... many of the algorithms using
> array indexing schemes give different results than the "brute" force.
> They tend to miss compound "words", such as "I like to use
> dashes--like this". Search for "I like to use dashes" and you may find
> that "dashes--like" was thought of as a word. Same problem with
> quotes.
Brian
Liked your token method, an approach I had not considered. I have not
tackled the question of re-grouping the search text because at this
stage I would need yet more information about Chris' intended
application. I have assumed he is finding characteristic triplets in
text to use in author-identification or to compare classes or styles of
literature or some such. In these circumstances I doubt that
punctuation matters. The only reason I stripped quotes in my version is
that I noticed unbalanced quotes in the search text but in a presumed
real application I would probably do it more comprehensively with
something like:
repeat with i = 33 to 47
replace NumToChar(i) in whateverText with space
end repeat -- and so on
Given that Chris' original SuperCard app takes over 15 seconds on the
sample text while we now have it under a second, I suspect we can
afford a few more bits of work to get the right technical result. I
would like to see all of the prime strategies we have devised published
on Chris Yavelow's web site, to prove the richness and versatility of
RunRev as well as its speed. As Richard Gaskin said early on, there is
more to an app than one speed test.
regards
David
More information about the use-livecode
mailing list