Indexing mail list messages

Alejandro Tejada capellan2000 at yahoo.com
Thu Jul 21 18:22:48 EDT 2005


Hi Developers,

i've started build indexes for searching
(from a CD-RW), keywords and phrases within the
200 MB of mail list messages.

Many of you suggest third party software,
but i'm sure that RR is able to search for
phrases and keywords within these text files.

The files range (for RR mail list messages) from
4.8 MB to 543k and my first idea is to create
two indexes for each of the 45 mail messages
text files.

The first index have a list of each
message subjects submitted in that month,
followed by the line or lines where this subject
is found in the text. For example:

message subject         lines where this text appears

Subject: Gif animation  75,124,257,310,358,

Creating this index took only a few minutes for
all the files.

The second index is for keywords within each 
text file, using the same approach.
Unfortunaly, using this approach, pairing
words with line offsets created in some cases
files bigger than the mail archive! :-(
For example, the june 2005 text file is only
4.8 MB, but the index is more than 5.3 MB...

After, i deleted the stop words from the index,
(search in Google for: "google stop words")
it was "reduced" to 3.5 MB. Still too big for
my taste.

Which approach could i take to build a smaller
and accurate word index for mail list archives?

Thanks in advance.

al 




Visit my site:
http://www.geocities.com/capellan2000/


		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 



More information about the use-livecode mailing list