Indexing mail list messages
Brian Yennie
briany at qldlearning.com
Thu Jul 21 19:08:51 EDT 2005
Alejandro,
Some off-the-cusp thoughts:
* Add synonyms for common xTalk terms (cd => card, btn => button, etc)
and combine their indices
* Support some sort of stemming (or at least, combine words with their
plurals)
* Create a stop word threshold: any term which occurs in more than X%
of messages becomes a stop word and is discarded from the index.
* Index by message, not by line. You could always find the line in the
message on the fly.
* Don't index all message headers
* Don't index message footers and/or signatures
* Remove dups (i.e. if a word appears twice on a line or twice in a
message)
Hope these give you some ideas.
Of course I also have a high level question- what's wrong with just a
5MB index on a CD-ROM? If it is just for disk space, you could compress
the index and probably get a significant savings.
- Brian
More information about the use-livecode
mailing list