[somewhat OT] Text processing question (sort of)

Robert Brenstein rjb at robelko.com
Mon May 19 17:34:37 EDT 2008


On 18/05/08 at 23:03 -0700 Kee Nethery apparently wrote:
>Interesting problem.
>
>if you are looking for typos, here are my thoughts.
>
>What are the probable errors? Seems to me you have:
>1. Typos in individual words
>2. Extra spaces in individual words (so that you end up with two 
>words instead of one)
>3. Punctuation differences
>4. Perhaps words such as; "the", "and", "an" missing from titles.
>
>...
>So long story short, slice and dice the quotes to collect a set of 
>pairs that appear to be similar. Then build a flashcard kind of 
>interface in RunRev that allows you the human to read the two 
>similar quotes and decide whether to delete one or not.
>
>I'd combine brute force with human visuals. 40000 lines seems like a 
>small data set for brute force.
>
>Kee Nethery

Finding identical lines is fairly trivial. Using fuzzy search to find 
similar lines is definitely more complicated. However, there are well 
known algorithms for detecting spelling errors. One of the common and 
rather simple approaches is to compute so called Damerau-Levenshtein 
distance. This is quite fast in Rev.

http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

The approach I'd take

0. find and eliminate identical items
1. clean the word spacing
2. find and eliminate identical items
3. compare and clean punctuation. This may require partially human 
inspection but the program can report those cases.
4. again eliminate identical items
5. use a simplified approach, like what Kee suggests or computing 
word factor as you suggested, to identify line pairs suspected to 
differ by spelling and other minor alterations.
6. compute Damerau-Levenshtein distance for those and report cases 
for human inspection.
7. correct typos and standardize texts as needed.
8. find and eliminate identical items.

Robert



More information about the use-livecode mailing list