Trying to code the fastest algorithm...

Jim Ault JimAultWins at yahoo.com
Sun Jun 5 16:52:57 EDT 2005


JB
>I'm trying to build the fastest possible algorithm for the
>following task :
>I have a variable containing reference words; each "word"
>in this variable is considered as an item.
>I also have a list of "sentences" (each sentence being a
>list of "words" separated by spaces), 1 per line. Each
>sentence can contain 1 or more words from the reference,
>as well as other words.
>
>I need to determine for each sentence, which words from
>the reference are included in it, in which order, and output
>a list of offsets.
>For example :
>    reference variable :  W1,W2,W3,W4
>    sentence 1 : Wx W2 Wy Wz W3
>    output : 2,3
>
>And last but not least, I need to keep only sentences that
>contain more than 1 word from the reference.
Also 
> set of reference words is unpredictable. It is actually derived from a
> sentence entered by the end user, and after some processing / selection
> a set words is extracted from that sentence and then compared with other
> sentences in a data base...

Consider conditioning your task so that the list of sentences in the
hit-detection variable is sorted by length, shortest to longest, then report
back intermediate results for the user to begin evaluating (eg. every 50
lines).  This would make the perceived wait time virtually zero.  Show a
progress indicator.

Another approach, since you are allowing the user to enter hit words,
including misspellings, is to test each user-word to see if it "is in
tTheWholeList" and report a true-false for each word before doing the
counts, and obviously not bothering to find it on each line.  This would be
very fast if no words qualified, and ask the user "do you want to continue"
before actually doing the 'search'.   This would save the user some
aggravation waiting for a bad entry to finish.

Along the same lines, you could break your sentences into arbitrary groups
and test each for "is it in gSectionAglobal", no, skip, "is it in
gSectionBglobal", yes, do the work,  "is it in gSectionCglobal", no, skip...

Are you considering a proximity parameter later?  This means that words must
occur within a certain 'neighborhood' or number of chars of each other.  If
this is the case, any sentence that is less than the proximity would not
have to even be evaluated for that.

If the database of sentences is static, think about pre-indexing the lines,
then just locating the list of line numbers for each reference word,
reporting the line number that "is in" at least one of the other index
candidates.  Adding new lines would only have to update the index, not
rebuild it.

Index could be the lineOffset("meal",indexx) --set whole matches to true
forthnight,22,95
meal,12,13,14,26,32,45,78,299
meat,3,26,44,78
meet,26,34,99
meeting,2,4,1099


Just some thoughts.  Are booleans coming later?  W1 and (W2 or W5)

Jim Ault
Las Vegas

On 6/5/05 12:47 PM, "jbv" <jbv.silences at Club-Internet.fr> wrote:

> 
> 
> Alex,
> 
>>>    repeat for each item i in it
>>>        get itemoffset(i,myReference)
>>>        if it>0 then
>>>            put it & itemdelimiter after b
>>>        end if
>>>    end repeat
>>> 
>>> 
>>> 
>> Shouldn't that be "repeat for each item i in a" rather than "... in it",
>> since "it" gets over-written within the loop ?
> 
> yes, you're right. That was a typo.
> Actually my script is slightly more complex, and I had to strip
> off a couple of things to make the problem easier to explain...
> hence the typo...
> 
> So here's a clean version of my code :
> 
> put "" into newList
> repeat for each line j in mySentences
>     put j into a
>     replace " " with itemdelimiter in a
>     put "" into b
> 
>     repeat for each item i in a
>         get itemoffset(i,myReference)
>         if it>0 then
>             put it & itemdelimiter after b
>         end if
>     end repeat
>     delete last char in b
> 
>     if b contains itemdelimiter then
>         put b & cr after newList
>     end if
>  end repeat
> 
>> 
>> I don't see anything here which "keeps only those sentences containing
>> more than 1 word" ?
> 
> it's the test :
>     if b contains itemdelimiter then
> which runs about 25% to 30% faster than
>     if number of items of b > 1 then
> 
>> 
>> 
>> Do you have a "typical" dataset in mind ?
>> or know any characteristics of it ?
> 
> as I said, only sentences, in french or english, with a maximum
> of 30 words.
> 
>> 
>> how many words in the reference list ?
> 
> can range between 1 and 25.
> 
>> 
>> how many sentences, and how long is a typical one ?
> 
> can range from 1 to several hundreds (hence the need for speed).
> 
>> 
>> how many "hits" vs "misses" ?
> 
> unpredictable...
> 
>> 
>> 
>> Since you're asking for "fastest" algorithm, I assume at least one of
>> these numbers is large - and it's unlikely the same algorithm would
>> excel at both ends of the spectrum .....
>> 
>> I have 2 or 3 approaches in mind ... any clues you have on the
>> characteristics of the data would help decide which ones to pursue .....
>> e.g.   build an assoc array of the words which occur in myReferences,
>> where the content of the array element is the word number within the
>> references, and lookup each word in the sentence with that ... should be
>> faster than "itemOffset" but probably not enough to justify it if the
>> number of words in myReferences is very small.
>> 
> 
> The "array" approach crossed my mind, but I'm afraid it's not feasable :
> the set of reference words is unpredictable. It is actually derived from a
> sentence entered by the end user, and after some processing / selection
> a set words is extracted from that sentence and then compared with other
> sentences in a data base...
> 
> Thanks for your time & collaboration anyway.
> JB
> 
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> http://lists.runrev.com/mailman/listinfo/use-revolution




More information about the use-livecode mailing list