[somewhat OT] Text processing question (sort of)
Robert Brenstein
rjb at robelko.com
Mon May 19 17:34:37 EDT 2008
On 18/05/08 at 23:03 -0700 Kee Nethery apparently wrote:
>Interesting problem.
>
>if you are looking for typos, here are my thoughts.
>
>What are the probable errors? Seems to me you have:
>1. Typos in individual words
>2. Extra spaces in individual words (so that you end up with two
>words instead of one)
>3. Punctuation differences
>4. Perhaps words such as; "the", "and", "an" missing from titles.
>
>...
>So long story short, slice and dice the quotes to collect a set of
>pairs that appear to be similar. Then build a flashcard kind of
>interface in RunRev that allows you the human to read the two
>similar quotes and decide whether to delete one or not.
>
>I'd combine brute force with human visuals. 40000 lines seems like a
>small data set for brute force.
>
>Kee Nethery
Finding identical lines is fairly trivial. Using fuzzy search to find
similar lines is definitely more complicated. However, there are well
known algorithms for detecting spelling errors. One of the common and
rather simple approaches is to compute so called Damerau-Levenshtein
distance. This is quite fast in Rev.
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
The approach I'd take
0. find and eliminate identical items
1. clean the word spacing
2. find and eliminate identical items
3. compare and clean punctuation. This may require partially human
inspection but the program can report those cases.
4. again eliminate identical items
5. use a simplified approach, like what Kee suggests or computing
word factor as you suggested, to identify line pairs suspected to
differ by spelling and other minor alterations.
6. compute Damerau-Levenshtein distance for those and report cases
for human inspection.
7. correct typos and standardize texts as needed.
8. find and eliminate identical items.
Robert
More information about the use-livecode
mailing list