[somewhat OT] Text processing question (sort of)

Kee Nethery kee at kagi.com
Mon May 19 02:03:02 EDT 2008


Interesting problem.

if you are looking for typos, here are my thoughts.

What are the probable errors? Seems to me you have:
1. Typos in individual words
2. Extra spaces in individual words (so that you end up with two words  
instead of one)
3. Punctuation differences
4. Perhaps words such as; "the", "and", "an" missing from titles.

I think I would generate a letter count for each quotation.

For your example:
"God bless America"    George W Bush    Houston, March 18 2005
"Godi bless America"    George W Bush    Huston, March 18 2005

The quotation letter counts are
2 1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "God bless  
America" (2 a's, 1 b, 1 c ...)
and
2 1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "Godi bless  
America"

Then if you sort by these number sets and compare to see how similar  
each count is, you;ll get potential matches that you should eyeball.  
For example, These two strings have all but one count exactly the  
same. I'd go through this process multiple times by rotating the first  
count to the rear and re-sorting.

1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2
1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2

and just keep doing that until every letter has had a chance to be the  
first in the sort.

Basically The thing I'd have it do is find pairs of quotes that appear  
to be very similar. Then once you have a huge list of potential pairs,  
have something that displays them to you in pairs so that you can  
quickly tell the interface to delete one or to skip it.

I really do think you are going to want to make no changes to the data  
unless you look at the matches with your eyeballs. You could very  
easily end up with two completely different quotes that are extremely  
similar, just because person B was playing with a famous quote from  
person A.

So long story short, slice and dice the quotes to collect a set of  
pairs that appear to be similar. Then build a flashcard kind of  
interface in RunRev that allows you the human to read the two similar  
quotes and decide whether to delete one or not.

I'd combine brute force with human visuals. 40000 lines seems like a  
small data set for brute force.

Kee Nethery



More information about the use-livecode mailing list