[somewhat OT] Text processing question (sort of)
Kee Nethery
kee at kagi.com
Mon May 19 02:03:02 EDT 2008
Interesting problem.
if you are looking for typos, here are my thoughts.
What are the probable errors? Seems to me you have:
1. Typos in individual words
2. Extra spaces in individual words (so that you end up with two words
instead of one)
3. Punctuation differences
4. Perhaps words such as; "the", "and", "an" missing from titles.
I think I would generate a letter count for each quotation.
For your example:
"God bless America" George W Bush Houston, March 18 2005
"Godi bless America" George W Bush Huston, March 18 2005
The quotation letter counts are
2 1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "God bless
America" (2 a's, 1 b, 1 c ...)
and
2 1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "Godi bless
America"
Then if you sort by these number sets and compare to see how similar
each count is, you;ll get potential matches that you should eyeball.
For example, These two strings have all but one count exactly the
same. I'd go through this process multiple times by rotating the first
count to the rear and re-sorting.
1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2
1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2
and just keep doing that until every letter has had a chance to be the
first in the sort.
Basically The thing I'd have it do is find pairs of quotes that appear
to be very similar. Then once you have a huge list of potential pairs,
have something that displays them to you in pairs so that you can
quickly tell the interface to delete one or to skip it.
I really do think you are going to want to make no changes to the data
unless you look at the matches with your eyeballs. You could very
easily end up with two completely different quotes that are extremely
similar, just because person B was playing with a famous quote from
person A.
So long story short, slice and dice the quotes to collect a set of
pairs that appear to be similar. Then build a flashcard kind of
interface in RunRev that allows you the human to read the two similar
quotes and decide whether to delete one or not.
I'd combine brute force with human visuals. 40000 lines seems like a
small data set for brute force.
Kee Nethery
More information about the use-livecode
mailing list