Indexing mail list messages

Alex Tweedly alex at tweedly.net
Thu Jul 21 18:57:28 EDT 2005


Alejandro Tejada wrote:

>The second index is for keywords within each 
>text file, using the same approach.
>Unfortunaly, using this approach, pairing
>words with line offsets created in some cases
>files bigger than the mail archive! :-(
>For example, the june 2005 text file is only
>4.8 MB, but the index is more than 5.3 MB...
>
>After, i deleted the stop words from the index,
>(search in Google for: "google stop words")
>it was "reduced" to 3.5 MB. Still too big for
>my taste.
>
>Which approach could i take to build a smaller
>and accurate word index for mail list archives?
>  
>
Are you indexing every line where the word exists ?
Could you instead index only the message number (or id, or first line of 
the message) ?

Or could you post the code / a stack to save me asking you another 50 
questions ... ? :-)

Are you keeping the whole mbox format ?  Or discarding the headers you 
don't need ?
How many different words remain after the stop words are discarded ?
How many lines in the file ?   How many entries per word ? (min, max, 
avg, mean, std dev) .. ?


-- 
Alex Tweedly       http://www.tweedly.net



-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.323 / Virus Database: 267.9.0/50 - Release Date: 16/07/2005




More information about the use-livecode mailing list