[ANN] Stemmer library for six roman languages

Eric Chatonet eric.chatonet at sosmartsoftware.com
Sun Nov 19 10:38:43 EST 2006


Hi all,

It has been a long time I had not uploaded any stack to RevOnline...
This new one is very specific:

Stemm Lib

Title: Stemmer Library
Category: Utilities
Description:
English , French, Italian, Spanish, German and Portuguese stemmers.  
English stemmer originally written by Ken Ray, others by Eric Chatonet.

Porter algorithms are very handy to automatically isolate the stem of  
a word (that is, the main part of a word to which affixes are added).  
However they are known not to be 100% reliable. To address this  
issue, I adopted the following approach:
Words are first checked against the list of words known to be parsed  
incorrectly (that is, incorrectly parsed when applying the algorithm  
on a corpus of 20 000 forms). If a match is found with an item from  
this list, the stem is defined by simple dictionary lookup. If not  
match is found, then the stem is defined using Porter's algorithm.  
With this approach, reliability was found to be higher than 99% for  
each one of the six languages :-)

Explanations on how to use this stack are in the lib itself.

Thanks to Marielle for having edited this description :-)

To find it: Username: sosmartsoftware

Best Regards from Paris,
Eric Chatonet
------------------------------------------------------------------------ 
----------------------
http://www.sosmartsoftware.com/    eric.chatonet at sosmartsoftware.com/





More information about the use-livecode mailing list