Indexing mail list messages
Alejandro Tejada
capellan2000 at yahoo.com
Thu Jul 21 18:22:48 EDT 2005
Hi Developers,
i've started build indexes for searching
(from a CD-RW), keywords and phrases within the
200 MB of mail list messages.
Many of you suggest third party software,
but i'm sure that RR is able to search for
phrases and keywords within these text files.
The files range (for RR mail list messages) from
4.8 MB to 543k and my first idea is to create
two indexes for each of the 45 mail messages
text files.
The first index have a list of each
message subjects submitted in that month,
followed by the line or lines where this subject
is found in the text. For example:
message subject lines where this text appears
Subject: Gif animation 75,124,257,310,358,
Creating this index took only a few minutes for
all the files.
The second index is for keywords within each
text file, using the same approach.
Unfortunaly, using this approach, pairing
words with line offsets created in some cases
files bigger than the mail archive! :-(
For example, the june 2005 text file is only
4.8 MB, but the index is more than 5.3 MB...
After, i deleted the stop words from the index,
(search in Google for: "google stop words")
it was "reduced" to 3.5 MB. Still too big for
my taste.
Which approach could i take to build a smaller
and accurate word index for mail list archives?
Thanks in advance.
al
Visit my site:
http://www.geocities.com/capellan2000/
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
More information about the use-livecode
mailing list