Words Indexing strategies

Richard Gaskin ambassador at fourthworld.com
Wed Feb 10 11:26:30 EST 2010


The ambitious Alejandro Tejada wrote:

> It's a global search, within a general index created
> using all words from all articles of Wikipedia.
> (I do not believe that it's necessary to load this full
> index in memory, instead just open specific parts
> of this index when users start searching)
>
> For this reason, i am looking for advice to create an
> index structure that allows to implement a fast search
> algorithm, using multiple words (and boolean logic, if
> possible), similar to Wikipedia's own search engine or
> (better yet) just like google. :-)

A good place to start on that is the seminal paper describing the 
initial Google implementation, written by the founders:

The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
<http://infolab.stanford.edu/~backrub/google.html>

But be warned:  indexing is a deep topic, and may become a consuming 
passion.  Roulette is rumored to have been invented by a monk who came 
to believe he could find a way to predict its outcomes, and eventually 
went mad trying.  Indexing is a bit like that. :)

A couple of the longer-term projects I work on need to incorporate good 
indexing of large corpuses, and my own code to that end has advanced 
only in small baby steps as I learn more about it.

So while I have little in the way of applied code to share at the 
moment, I can offer a few theoretical pointers:

 From what I'm reading, my first advice would be to not bother unless 
you absolutely need to.  If there's any way you can use an existing 
index you'll be a happier man to do so.

But if you have to make your own index, you may find it helpful (or 
maddening) to consider the challenges involved with the various tenses 
and inflections of words, and how to determine the root word in its 
native form (the "lemma") for your lookups.

Currently I'm looking into the various lemmatization schemes available, 
since it can help tremendously to both keep the index small enough to be 
practical while returning more relevant search results.

Lemmatization attempts this through linguistic rules; stemming could be 
said to be a form of "cheating" by using simpler algos less dependent on 
the nuances of a given language to attempt the same result.  The 
differences are explained well here:

Stemming and lemmatization
<http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html>

Links to specific stemming algos are at the bottom of this article:
<http://www.comp.lancs.ac.uk/computing/research/stemming/general/>

Without lemmatization or stemming, searches for "children" will fail to 
find "child", which could well be relevant to the searcher.  And some 
stemming methods won't be able to transform "children" to "child" since 
it's an uncommon transformation linguistically, much less so than more 
common plural forms like just adding "s" or "es" to the end.

There's a wealth of info available searching the web for "indexing 
algorithms", "stemming algorithms", etc.  Doing it well may take a 
lifetime; fortunately it seems there are some "cheating" methods which 
will do most of the job well enough in less time.


All that said, I have to wonder: if Wikipedia's content is available, 
isn't their index also available?

Porting it from MySQL to SQLite seems a far less daunting task than 
writing an index from scratch.

--
  Richard Gaskin
  Fourth World
  Rev training and consulting: http://www.fourthworld.com
  Webzine for Rev developers: http://www.revjournal.com
  revJournal blog: http://revjournal.com/blog.irv



More information about the use-livecode mailing list