Semi-automatic Index generation?
viktoras at ekoinf.net
Thu Jul 31 04:12:29 EDT 2008
you might wish to discard the 1000 most frequently used words from your
Another approach is statistical - take the whole text, sort words by
their frequency (count) of appearance in the text. If you put them on a
graph you would notice characteristic 'power law' distribution. Set the
absolute or relative frequency or count value at which to cut the tail.
This tail is what holds all the rare or interesting words of the text.
For example if the text is large you may discard the first 500-1000
words in the list sorted by word count. All words that remain should be
the ones that are more-less interesting.
The easy way produce such a frequency list is by using arrays. The
principle is like this:
repeat for each word myWord in theText
add 1 to arrayWords[myWord]
now the keys are words and values are word counts in arrayWords.
David Bovill wrote:
> Is there a resource/ index that any one knows of for plain uninteresting
> dull words. I want to take arbitrary chunks of text and search for
> "interesting" words - that is domain specific words that might be useful to
> links to create dictionary entries. This would mean creating a list of words
> and stripping "the" "it" etc. I am imagining it working like a spelling
> dictionary with the ability to manually edit entries - but I'd like a good
> starting list? Not sure what to search for :)
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
More information about the use-livecode