Indexing mail list messages

Alejandro Tejada capellan2000 at yahoo.com
Thu Jul 21 21:36:06 EDT 2005


Alex Tweedley wrote:

> Are you indexing every line where the word exists ?

Oh yes, in a first try i was guilty of that... :-(

> Could you instead index only the message number (or 
> id, or first line of the message) ?

Ah! The msg id... this is a good choice because
this specific line is not repeated when
developers replies to a message.
So there is only one msg id for every msg. :-)

> Or could you post the code / a stack to save me 
> asking you another 50 questions ... ? :-)

Here is the first iteration that produced
an index larger than the indexed file.
(i'll change it to work, not with line numbers, but
with message number)

-- start script --
on mouseUp
  -- based in Scott Raney's example code
  -- comments that start with "#" are his...

  answer file "Select a mail message text file for
input:"
  if it is empty then exit mouseUp
  # let user know we're working on it
  set the cursor to watch
  put it into inputFile
  set the itemdelimiter to "\"
  put the last item of inputFile into zvn
  put ".wndx" into char -4 to -1 of zvn
  put "file:" before zvn
  
  open file inputFile for read
  read from file inputFile until eof
  put it into fileContent
  close file inputFile
  
  repeat for each line w in fileContent
    add 1 to mylinecount
    repeat for each word z in w
      put mylinecount & comma after wordCount[z]
    end repeat
  end repeat
  
  # copy all the indexes that is in the wordCount
associative array
  put keys(wordCount) into keyWords
  # sort the indexes -- keyWords contains a list of
elements in array
  sort keyWords
  repeat for each line l in keyWords
    put l & tab & wordCount[l] & return after
displayResult
  end repeat
  put displayResult into URL zvn

  -- look for a file with the extension *.wndx
  -- in the same location of the selected text file
  -- This *.wndx file contains the index.

end mouseUp

-- end script --

> Are you keeping the whole mbox format ?  

Yes, completely.

> Or discarding the headers you don't need ?

No, the file is complete without change.

> How many different words remain after the stop words
> are discarded ?

Not too many words, but there are a lot
of similar words that change a little
in their endings.

> How many lines in the file ?   
> How many entries per word ? 
> (min, max, avg, mean, std dev) .. ?

With the code above, and this file:
<http://mail.runrev.com/pipermail/use-revolution/2005-June.txt.gz>
the answer to these questions is at a glance. ;-)

I'll keep building on these new ideas!
Thanks a lot for your help! 

al

Visit my site:
http://www.geocities.com/capellan2000/

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



More information about the use-livecode mailing list