Words Indexing strategies

Brian Yennie briany at qldlearning.com
Wed Feb 10 11:17:08 EST 2010


Alejandro,

The first step for this would likely include creating an "inverted" index. This means you store something like:

monkey:1,34,3827,21314

Where the word being indexed in "monkey" and the numbers that follow are article IDs. Using this information it is pretty trivial to implement AND / OR. Just merge the article IDs as needed. You can use a mix of this inverted index and your regular index in combination to do other types of queries. You can even store word frequency information to find similar articles.

With that said, I think you will have a very hard time crafting a competitive algorithm for this in Rev. A lot of the engines out there are very mature and there are many open source ones that you might consider bundling with your project instead of writing it yourself. Some things to watch out for:

1) File size. Indexing every word takes a lot of space, especially if you indices are in plain text and not a compact binary format
2) Common words. You might consider a "stop word" list, or a threshold. For example, if a word is in > 20% of all articles, don't index it. Or if it's in your "stop word" list.
3) Root words / plurals. Can you detect that "monkeys" is the plural of "monkey" (or more complex cases)?

An example of an open source engine that is very mature is Lucene, which can be run from the command-line in Java.

> Hi Bernard,
> 
> on Wed, 10 Feb 2010 07:20:36 +0000
> Bernard Devlin wrote:
>> Can I just clarify your problem?  You want to be able to search for
>> phrases (partial sentences, possibly with boolean logic) inside the
>> text stored in the xml nodes of the article, once the article is found
>> in the index?
> 
> No, it's not a search inside the displayed article.
> It's a global search, within a general index created
> using all words from all articles of Wikipedia.
> (I do not believe that it's necessary to load this full
> index in memory, instead just open specific parts
> of this index when users start searching)
> 
> For this reason, i am looking for advice to create an
> index structure that allows to implement a fast search
> algorithm, using multiple words (and boolean logic, if
> possible), similar to Wikipedia's own search engine or
> (better yet) just like google. :-)
> 
> Many thanks for your interest in this question!
> 
> Alejandro
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolutio



More information about the use-livecode mailing list