Indexing mail list messages

Brian Yennie briany at qldlearning.com
Thu Jul 21 19:08:51 EDT 2005


Alejandro,

Some off-the-cusp thoughts:

* Add synonyms for common xTalk terms (cd => card, btn => button, etc) 
and combine their indices
* Support some sort of stemming (or at least, combine words with their 
plurals)
* Create a stop word threshold: any term which occurs in more than X% 
of messages becomes a stop word and is discarded from the index.
* Index by message, not by line. You could always find the line in the 
message on the fly.
* Don't index all message headers
* Don't index message footers and/or signatures
* Remove dups (i.e. if a word appears twice on a line or twice in a 
message)

Hope these give you some ideas.

Of course I also have a high level question- what's wrong with just a 
5MB index on a CD-ROM? If it is just for disk space, you could compress 
the index and probably get a significant savings.

- Brian




More information about the use-livecode mailing list