Words Indexing strategies

Alejandro Tejada capellan2000 at gmail.com
Sat Feb 13 11:23:28 EST 2010


Hi Bernard,


Bernard Devlin-2 wrote:
> 
> [snip]
> Am I right that given these search terms: baboon OR monkey AND fruit
> and
> index file b.tgz contains a line like this: baboon: 1,5,9
> index file m.tgz contains a line like this: monkey: 2,7,17
> index file f.tgz contains a line like this: fruit: 3,7,23
> you would want the result of your search to be: 7 i.e. the number of
> the article that matches the boolean search?  Unless I've
> misunderstood, what you want to do is combine indexes in order to
> satisfy boolean combinations of search terms.
> 

Yes, this is correct.
With larger sets of data, should be easier to convert
these results in arrays and merge them, following
the explanations provided by Ken Ray in his website:
http://www.sonsothunder.com/devres/revolution/tips/arry002.htm

Now, you have made a valid point with the following commentary: 


Bernard Devlin-2 wrote:
> 
> However, it looks to me like the existing indexes don't contain enough
> information for you to calculate frequency of occurrence (a measure of
> relevance).  And depending on how these pre-existing indexes have been
> constructed they may not have any stemming information in them.  You
> might be able to build some kind of rough stemming algorithm in Rev
> (by doing rough pluralization like 'baboon*', but as Richard pointed
> out more complex plurals like 'children' will be where the work
> comes).
> 
> Are you looking for an approximate solution?  Or do you need greater
> flexibility of scope and relevance scores, etc. ?
> 

Starting this project, an approximate solution could be fine, and
eventually,
keep working to refine this search algorithm.

After taking a close look at the 580,000 articles index (and 450,000
redirections index),
i understand that employing effectively an stemming algorithm could save
many megabytes.

You are right about lack of information in current simple index format.
Relevance should be a function of the number of times a word appear
in an article. An index that could include this information would be:

monkey:3827#12|15,1#4|3,2131#18|3,34#3|2,4567#2|2,3456#22|1

In this new example, compressed Datafile number 3827, article # 12,
there are 15 instances of the word "Monkey"...

Surely, there should be some better notation to handle this data,
feel free to send your ideas and comments.
Again, converting this data in an array, make easier to work with.


Richard Gaskin wrote:
> 
> Once again, MetaCard to the rescue! :)
> 
> Raney included this little gem in MC's Examples stack, and using "repeat
> for each" and arrays it's blazing fast, able to make a frequency table
> for even large files in almost no time at all:
> 

Yes, this works great! Many thanks for digging this handler. :-)

Have a nice weekend!

Alejandro
-- 
View this message in context: http://n4.nabble.com/Words-Indexing-strategies-tp1473753p1554526.html
Sent from the Revolution - User mailing list archive at Nabble.com.



More information about the use-livecode mailing list