Indexing mail list messages
Alex Tweedly
alex at tweedly.net
Thu Jul 21 18:57:28 EDT 2005
Alejandro Tejada wrote:
>The second index is for keywords within each
>text file, using the same approach.
>Unfortunaly, using this approach, pairing
>words with line offsets created in some cases
>files bigger than the mail archive! :-(
>For example, the june 2005 text file is only
>4.8 MB, but the index is more than 5.3 MB...
>
>After, i deleted the stop words from the index,
>(search in Google for: "google stop words")
>it was "reduced" to 3.5 MB. Still too big for
>my taste.
>
>Which approach could i take to build a smaller
>and accurate word index for mail list archives?
>
>
Are you indexing every line where the word exists ?
Could you instead index only the message number (or id, or first line of
the message) ?
Or could you post the code / a stack to save me asking you another 50
questions ... ? :-)
Are you keeping the whole mbox format ? Or discarding the headers you
don't need ?
How many different words remain after the stop words are discarded ?
How many lines in the file ? How many entries per word ? (min, max,
avg, mean, std dev) .. ?
--
Alex Tweedly http://www.tweedly.net
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.323 / Virus Database: 267.9.0/50 - Release Date: 16/07/2005
More information about the use-livecode
mailing list