Words Indexing strategies
Alejandro Tejada
capellan2000 at gmail.com
Wed Feb 10 17:30:01 EST 2010
Many thanks for replying this question. :-)
on Wed, 10 Feb 2010 15:30:23 +0000
Bernard Devlin wrote:
> OK, so that's why you mention the different files for each letter of
> the alphabet.
Yes, each one of these 28 text files will be compressed
in gz format. When users look for a word, or many words,
only these file(s) are decompressed and searched.
> I'm still a bit confused. Normally an index would indicate a location
> for an indexed term That's what I assume your general index files are
> doing. What are the key terms like in this index, and what do they
> point to? Can you give us some examples?
Sure, look the example that Brian Yennie wrote:
monkey:1,34,3827,21314
The numbers (after the word) are the names of compressed gz files.
These files are located automatically or (if not found) manually,
when the program starts.
> To my confused and befuddled mind, it sounds like you are wanting to
> create an index of the index. That can't be right :-)
No, this program uses only three kind of indexes:
1) Articles 2) Articles name's redirections 3) Words
> I'm no expert in search algorithms. I have been hoping someone else
> would jump in who has done this kind of thing before.
> Are you wanting a pure, rev-only solution i.e. are you doing this to
> demonstrate what can be done using Rev alone?
Ideally, this should be a Rev only solution, for cross-platform
porting.
on Wed, 10 Feb 2010 11:17:08 -0500
Brian Yennie wrote:
> The first step for this would likely include creating an "inverted" index.
> This means you store something like: monkey:1,34,3827,2131
> Where the word being indexed in "monkey" and the numbers
> that follow are article IDs.
> Using this information it is pretty trivial to implement AND / OR.
> Just merge the article IDs as needed. You can use a mix of this inverted
> index and your regular index in combination to do other types of queries.
> You can even store word frequency information to find similar articles.
Yes, this is correct and should work fine, but how could i write in the
word index a range of article where a word appears consecutively:
baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578
> With that said, I think you will have a very hard time crafting a competitive
> algorithm for this in Rev. A lot of the engines out there are very mature and
> there are many open source ones that you might consider bundling with
> your project instead of writing it yourself.
Actually, i look for good performance and cross-platform portability.
So Rev is my first choice.
> Some things to watch out for:
> 1) File size. Indexing every word takes a lot of space, especially if you indices
> are in plain text and not a compact binary format
How could i convert this index format in a compact binary format?
baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578
monkey:1,34,3827,2131, 3456,4567,5678,5789,6123,6234,6456
> 2) Common words. You might consider a "stop word" list, or a threshold.
> For example, if a word is in > 20% of all articles, don't index it.
> Or if it's in your "stop word" list.
Previously i believed that stop words should appear in all articles.
> 3) Root words / plurals. Can you detect that "monkeys"
> is the plural of "monkey" (or more complex cases)?
Richard wrote about a similar concern in his answer.
I suppose that this feature is useful to recommend similar
terms, when users start a new search.
> An example of an open source engine that is very mature
> is Lucene, which can be run from the command-line in Java.
How could i run Java applications from Runrev,
without asking users to install Java first?
on Wed, 10 Feb 2010 08:26:30 -0800
Richard Gaskin wrote:
> The ambitious Alejandro Tejada wrote:
this is only a really modest search
:-D
> A good place to start on that is the seminal paper describing the
> initial Google implementation, written by the founders:
>
> The Anatomy of a Large-Scale Hypertextual Web Search Engine
> Sergey Brin and Lawrence Page
> <http://infolab.stanford.edu/~backrub/google.html>
>
> But be warned: indexing is a deep topic, and may become a consuming
> passion. Roulette is rumored to have been invented by a monk who came
> to believe he could find a way to predict its outcomes, and eventually
> went mad trying. Indexing is a bit like that. :)
Many thanks for pointers in this direction.
> A couple of the longer-term projects I work on need to incorporate good
> indexing of large corpuses, and my own code to that end has advanced
> only in small baby steps as I learn more about it.
[snip of really useful information]
> All that said, I have to wonder: if Wikipedia's content is available,
> isn't their index also available?
An article index is available, but a word index is not.
> Porting it from MySQL to SQLite seems a far less daunting task than
> writing an index from scratch.
SQLite is way over my head in this moment. :-)
Thanks again for answering this request!
Alejandro
More information about the use-livecode
mailing list