[somewhat OT] Text processing question (sort of)

jbv jbv.silences at club-internet.fr
Sun May 18 12:41:46 EDT 2008


Hi list,

I've been asked to do some "cleaning" in a client's data, and am trying
to figure out some simple and fast algorithm to do the job in Rev, but
haven't got much success so far...

Here's the problem : the data consists in a collection of quotations by
various writers, politicians, etc. The data is organized in lines of 3
items :
the quote, the author, the place & date
The cleaning job consists in finding duplicates caused by typos.

Here's an (imaginary) example :
"God bless America"    George W Bush    Houston, March 18 2005
"Godi bless America"    George W Bush    Huston, March 18 2005

Typos can occur in any of the 3 items, and sometimes even in 2 or 3
items of the same line...
Last but not least, the data consists in about 40000 lines...

The first idea that comes to mind is a kind of brute force approach :
to compare each line, item by item, with each of the other lines,
compute
a ratio of identical words, and keep only lines where the ratio found
for
each item is above a certain threshold (say 80%)... The problem with
such
huge set of data, is that it might take forever...

I've also tried to sort lines and compare each line with the previous
one only,
but if the typo occurs in the first char of any item of a line,
duplicates might be
far away from each other after the sort... so it won't work...

Any idea ?

thanks in advance,
JB




More information about the use-livecode mailing list