Semi-automatic Index generation?

Thu Jul 31 13:44:29 EDT 2008

On Jul 31, 2008, at 2:12 AM, viktoras didziulis wrote:

> Hi David,
>
> you might wish to discard the 1000 most frequently used words from  
> your
> list:
> English: http://web1.d25.k12.id.us/home/curriculum/fuw.pdf
> German: http://german.about.com/library/blwfreq01.htm
>
> Another approach is statistical - take the whole text, sort words by
> their frequency (count) of appearance in the text. If you put them  
> on a
> graph you would notice  characteristic 'power law' distribution. Set  
> the
> absolute or relative frequency or count value at which to cut the  
> tail.
> This tail is what holds all the rare or interesting words of the text.
> For example if the text is large you may discard the first 500-1000
> words in the list sorted by word count. All words that remain should  
> be
> the ones that are more-less interesting.
>
> The easy way produce such a frequency list is by using arrays. The
> principle is like this:
>
> local arrayWords
> repeat for each word myWord in theText
> add 1 to arrayWords[myWord]
> end repeat
>
> now the keys are words and values are word counts in arrayWords.

Slick, and so simple. This is going into my script library. Thanks,  
Viktoras!

Regards,

Devin

Devin Asay
Humanities Technology and Research Support Center
Brigham Young University