Words Indexing strategies
Richard Gaskin
ambassador at fourthworld.com
Wed Feb 10 11:26:30 EST 2010
The ambitious Alejandro Tejada wrote:
> It's a global search, within a general index created
> using all words from all articles of Wikipedia.
> (I do not believe that it's necessary to load this full
> index in memory, instead just open specific parts
> of this index when users start searching)
>
> For this reason, i am looking for advice to create an
> index structure that allows to implement a fast search
> algorithm, using multiple words (and boolean logic, if
> possible), similar to Wikipedia's own search engine or
> (better yet) just like google. :-)
A good place to start on that is the seminal paper describing the
initial Google implementation, written by the founders:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
<http://infolab.stanford.edu/~backrub/google.html>
But be warned: indexing is a deep topic, and may become a consuming
passion. Roulette is rumored to have been invented by a monk who came
to believe he could find a way to predict its outcomes, and eventually
went mad trying. Indexing is a bit like that. :)
A couple of the longer-term projects I work on need to incorporate good
indexing of large corpuses, and my own code to that end has advanced
only in small baby steps as I learn more about it.
So while I have little in the way of applied code to share at the
moment, I can offer a few theoretical pointers:
From what I'm reading, my first advice would be to not bother unless
you absolutely need to. If there's any way you can use an existing
index you'll be a happier man to do so.
But if you have to make your own index, you may find it helpful (or
maddening) to consider the challenges involved with the various tenses
and inflections of words, and how to determine the root word in its
native form (the "lemma") for your lookups.
Currently I'm looking into the various lemmatization schemes available,
since it can help tremendously to both keep the index small enough to be
practical while returning more relevant search results.
Lemmatization attempts this through linguistic rules; stemming could be
said to be a form of "cheating" by using simpler algos less dependent on
the nuances of a given language to attempt the same result. The
differences are explained well here:
Stemming and lemmatization
<http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html>
Links to specific stemming algos are at the bottom of this article:
<http://www.comp.lancs.ac.uk/computing/research/stemming/general/>
Without lemmatization or stemming, searches for "children" will fail to
find "child", which could well be relevant to the searcher. And some
stemming methods won't be able to transform "children" to "child" since
it's an uncommon transformation linguistically, much less so than more
common plural forms like just adding "s" or "es" to the end.
There's a wealth of info available searching the web for "indexing
algorithms", "stemming algorithms", etc. Doing it well may take a
lifetime; fortunately it seems there are some "cheating" methods which
will do most of the job well enough in less time.
All that said, I have to wonder: if Wikipedia's content is available,
isn't their index also available?
Porting it from MySQL to SQLite seems a far less daunting task than
writing an index from scratch.
--
Richard Gaskin
Fourth World
Rev training and consulting: http://www.fourthworld.com
Webzine for Rev developers: http://www.revjournal.com
revJournal blog: http://revjournal.com/blog.irv
More information about the use-livecode
mailing list