Trying to code the fastest algorithm...
Alex Tweedly
alex at tweedly.net
Sun Jun 5 15:01:12 EDT 2005
jbv wrote:
>Hi list,
>
>I'm trying to build the fastest possible algorithm for the
>following task :
>I have a variable containing reference words; each "word"
>in this variable is considered as an item.
>I also have a list of "sentences" (each sentence being a
>list of "words" separated by spaces), 1 per line. Each
>sentence can contain 1 or more words from the reference,
>as well as other words.
>
>I need to determine for each sentence, which words from
>the reference are included in it, in which order, and output
>a list of offsets.
>For example :
> reference variable : W1,W2,W3,W4
> sentence 1 : Wx W2 Wy Wz W3
> output : 2,3
>
>And last but not least, I need to keep only sentences that
>contain more than 1 word from the reference.
>
>Here's the fastest script I managed to code :
>
>put "" into newList
>repeat for each line j in mySentences
> put j into a
> replace " " with itemdelimiter in a
> put "" into b
>
> repeat for each item i in it
> get itemoffset(i,myReference)
> if it>0 then
> put it & itemdelimiter after b
> end if
> end repeat
>
>
>
Shouldn't that be "repeat for each item i in a" rather than "... in it",
since "it" gets over-written within the loop ?
> if b contains itemdelimiter then
> delete last char of b
> put b & cr after newList
> end if
> end repeat
>
>But I was wondering if there was any faster approach...
>
>
I don't see anything here which "keeps only those sentences containing
more than 1 word" ?
newList just contains the list of offsets - with no indication which
sentence they occurred in ???
Do you have a "typical" dataset in mind ?
or know any characteristics of it ?
how many words in the reference list ?
how many sentences, and how long is a typical one ?
how many "hits" vs "misses" ?
Since you're asking for "fastest" algorithm, I assume at least one of
these numbers is large - and it's unlikely the same algorithm would
excel at both ends of the spectrum .....
I have 2 or 3 approaches in mind ... any clues you have on the
characteristics of the data would help decide which ones to pursue .....
e.g. build an assoc array of the words which occur in myReferences,
where the content of the array element is the word number within the
references, and lookup each word in the sentence with that ... should be
faster than "itemOffset" but probably not enough to justify it if the
number of words in myReferences is very small.
--
Alex Tweedly http://www.tweedly.net
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.323 / Virus Database: 267.6.2 - Release Date: 04/06/2005
More information about the use-livecode
mailing list