Indexing mail list messages

Alejandro Tejada capellan2000 at yahoo.com
Thu Jul 21 21:51:03 EDT 2005


Hi Brian, :-)

Brian Yennie wrote:

> Some off-the-cusp thoughts:
> * Add synonyms for common xTalk terms (cd => card, 
> btn => button, etc) and combine their indices

Interesting idea, i'll give more thought
to this possibility.

> * Support some sort of stemming (or at least,
combine 
> words with their plurals)

Yes, this is a must.

> * Create a stop word threshold: any term which
occurs 
> in more than X% of messages becomes a stop word and 
> is discarded from the index.

This is a good recomendation. For example,
the word "revolution" should be a stop word. :-)

> * Index by message, not by line. You could always 
> find the line in the message on the fly.

Yes, Alex Tweedley makes this recomendation too.

> * Don't index all message headers
> * Don't index message footers and/or signatures

The headers contains some useful info... No?

> * Remove dups (i.e. if a word appears twice on a
line 
> or twice in a message)

Yes, this is a must too.

> Hope these give you some ideas.

Sure they do! These are mind opening
ideas. You could be sure that many other
ideas, probably unrelated to this task
will take life while working on this... :-)

Today i have step on an interesting idea for
a new educative game. Let's keep the hope
to raise the resources to make this game a reality!

> Of course I also have a high level question- what's 
> wrong with just a 5MB index on a CD-ROM? If it is 
> just for disk space, you could compress 
> the index and probably get a significant savings.

Space is not the problem, fast searching in optimized
indexes are the goal. ;-)

Thanks again for your help, Brian!

al

Visit my site:
http://www.geocities.com/capellan2000/


		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 



More information about the use-livecode mailing list