Words Indexing strategies

Brian Yennie briany at qldlearning.com
Wed Feb 10 17:48:49 EST 2010


> Yes, this is correct and should work fine, but how could i write in the
> word index a range of article where a word appears consecutively:
> baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578

If this were your format, you could compact to something like:
baboon:1934,2345-2352,2567,3578

> How could i convert this index format in a compact binary format?
> 
> baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578
> monkey:1,34,3827,2131, 3456,4567,5678,5789,6123,6234,6456

Well there are a lot of possibilities that are probably way beyond the scope of this discussion, however, for starters you could convert each number from text to binary. You could also go for a BTree structure, but that is going to be awfully difficult in Rev.

> Previously i believed that stop words should appear in all articles.

I would go with a threshold for sure. Think about what it means for index size if a word is in 50% of all articles or more. And why would you want to search for that word anyway?

> Richard wrote about a similar concern in his answer.
> I suppose that this feature is useful to recommend similar
> terms, when users start a new search.

Yes, but it's also "built-in" to the results in most modern search engines. It will help you return better results. Think of the simple case where someone searches for "monkeys" but doesn't find an article named "Monkey". Although it seems obvious that these are not the same word, your users can easily be frustrated.

> How could i run Java applications from Runrev,
> without asking users to install Java first?

You would have to find a way to bundle it with your app. The upside is, this would be much easier than trying to write something equivalent in Rev. You may vary well be able to craft something that meets your needs, but in terms of performance and accuracy you'll have a nearly impossible time matching some of the more mature search engines out there.




More information about the use-livecode mailing list