Trying to code the fastest algorithm...

Alex Tweedly alex at tweedly.net
Sun Jun 5 15:01:12 EDT 2005


jbv wrote:

>Hi list,
>
>I'm trying to build the fastest possible algorithm for the
>following task :
>I have a variable containing reference words; each "word"
>in this variable is considered as an item.
>I also have a list of "sentences" (each sentence being a
>list of "words" separated by spaces), 1 per line. Each
>sentence can contain 1 or more words from the reference,
>as well as other words.
>
>I need to determine for each sentence, which words from
>the reference are included in it, in which order, and output
>a list of offsets.
>For example :
>    reference variable :  W1,W2,W3,W4
>    sentence 1 : Wx W2 Wy Wz W3
>    output : 2,3
>
>And last but not least, I need to keep only sentences that
>contain more than 1 word from the reference.
>
>Here's the fastest script I managed to code :
>
>put "" into newList
>repeat for each line j in mySentences
>    put j into a
>    replace " " with itemdelimiter in a
>    put "" into b
>
>    repeat for each item i in it
>        get itemoffset(i,myReference)
>        if it>0 then
>            put it & itemdelimiter after b
>        end if
>    end repeat
>
>  
>
Shouldn't that be "repeat for each item i in a" rather than "... in it", 
since "it" gets over-written within the loop ?

>    if b contains itemdelimiter then
>        delete last char of b
>        put b & cr after newList
>    end if
> end repeat
>
>But I was wondering if there was any faster approach...
>  
>
I don't see anything here which "keeps only those sentences containing 
more than 1 word" ?
newList just contains the list of offsets - with no indication which 
sentence they occurred in ???

Do you have a "typical" dataset in mind ? 
or know any characteristics of it ?
how many words in the reference list ?
how many sentences, and how long is a typical one ?
how many "hits" vs "misses" ?

Since you're asking for "fastest" algorithm, I assume at least one of 
these numbers is large - and it's unlikely the same algorithm would 
excel at both ends of the spectrum .....

I have 2 or 3 approaches in mind ... any clues you have on the 
characteristics of the data would help decide which ones to pursue .....
e.g.   build an assoc array of the words which occur in myReferences, 
where the content of the array element is the word number within the 
references, and lookup each word in the sentence with that ... should be 
faster than "itemOffset" but probably not enough to justify it if the 
number of words in myReferences is very small.

-- 
Alex Tweedly       http://www.tweedly.net



-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.323 / Virus Database: 267.6.2 - Release Date: 04/06/2005



More information about the use-livecode mailing list