Words Indexing strategies

Alejandro Tejada capellan2000 at gmail.com
Wed Feb 10 17:30:01 EST 2010


Many thanks for replying this question. :-)

on Wed, 10 Feb 2010 15:30:23 +0000
Bernard Devlin wrote:

> OK, so that's why you mention the different files for each letter of
> the alphabet.

Yes, each one of these 28 text files will be compressed
in gz format. When users look for a word, or many words,
only these file(s) are decompressed and searched.

> I'm still a bit confused.  Normally an index would indicate a location
> for an indexed term  That's what I assume your general index files are
> doing.  What are the key terms like in this index, and what do they
> point to?  Can you give us some examples?

Sure, look the example that Brian Yennie wrote:
monkey:1,34,3827,21314

The numbers (after the word) are the names of compressed gz files.
These files are located automatically or (if not found) manually,
when the program starts.

> To my confused and befuddled mind, it sounds like you are wanting to
> create an index of the index.  That can't be right :-)

No, this program uses only three kind of indexes:
1) Articles 2) Articles name's redirections 3) Words

> I'm no expert in search algorithms.  I have been hoping someone else
> would jump in who has done this kind of thing before.
> Are you wanting a pure, rev-only solution i.e. are you doing this to
> demonstrate what can be done using Rev alone?

Ideally, this should be a Rev only solution, for cross-platform
porting.

on Wed, 10 Feb 2010 11:17:08 -0500
Brian Yennie wrote:

> The first step for this would likely include creating an "inverted" index.
> This means you store something like: monkey:1,34,3827,2131
> Where the word being indexed in "monkey" and the numbers
> that follow are article IDs.
> Using this information it is pretty trivial to implement AND / OR.
> Just merge the article IDs as needed. You can use a mix of this inverted
> index and your regular index in combination to do other types of queries.
> You can even store word frequency information to find similar articles.

Yes, this is correct and should work fine, but how could i write in the
word index a range of article where a word appears consecutively:
baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578

> With that said, I think you will have a very hard time crafting a competitive
> algorithm for this in Rev. A lot of the engines out there are very mature and
> there are many open source ones that you might consider bundling with
> your project instead of writing it yourself.

Actually, i look for good performance and cross-platform portability.
So Rev is my first choice.

> Some things to watch out for:
> 1) File size. Indexing every word takes a lot of space, especially if you indices
> are in plain text and not a compact binary format

How could i convert this index format in a compact binary format?

baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578
monkey:1,34,3827,2131, 3456,4567,5678,5789,6123,6234,6456

> 2) Common words. You might consider a "stop word" list, or a threshold.
> For example, if a word is in > 20% of all articles, don't index it.
> Or if it's in your "stop word" list.

Previously i believed that stop words should appear in all articles.

> 3) Root words / plurals. Can you detect that "monkeys"
> is the plural of "monkey" (or more complex cases)?

Richard wrote about a similar concern in his answer.
I suppose that this feature is useful to recommend similar
terms, when users start a new search.

> An example of an open source engine that is very mature
> is Lucene, which can be run from the command-line in Java.

How could i run Java applications from Runrev,
without asking users to install Java first?

on Wed, 10 Feb 2010 08:26:30 -0800
Richard Gaskin wrote:

> The ambitious Alejandro Tejada wrote:

this is only a really modest search
:-D

> A good place to start on that is the seminal paper describing the
> initial Google implementation, written by the founders:
>
> The Anatomy of a Large-Scale Hypertextual Web Search Engine
> Sergey Brin and Lawrence Page
> <http://infolab.stanford.edu/~backrub/google.html>
>
> But be warned:  indexing is a deep topic, and may become a consuming
> passion.  Roulette is rumored to have been invented by a monk who came
> to believe he could find a way to predict its outcomes, and eventually
> went mad trying.  Indexing is a bit like that. :)

Many thanks for pointers in this direction.

> A couple of the longer-term projects I work on need to incorporate good
> indexing of large corpuses, and my own code to that end has advanced
> only in small baby steps as I learn more about it.

[snip of really useful information]


> All that said, I have to wonder: if Wikipedia's content is available,
> isn't their index also available?

An article index is available, but a word index is not.

> Porting it from MySQL to SQLite seems a far less daunting task than
> writing an index from scratch.

SQLite is way over my head in this moment. :-)

Thanks again for answering this request!

Alejandro



More information about the use-livecode mailing list