[somewhat OT] Text processing question (sort of)

Terry Judd tsj at unimelb.edu.au
Sun May 18 17:52:48 EDT 2008


On 19/5/08 2:41 AM, "jbv" <jbv.silences at club-internet.fr> wrote:

> Hi list,
> 
> I've been asked to do some "cleaning" in a client's data, and am trying
> to figure out some simple and fast algorithm to do the job in Rev, but
> haven't got much success so far...
> 
> Here's the problem : the data consists in a collection of quotations by
> various writers, politicians, etc. The data is organized in lines of 3
> items :
> the quote, the author, the place & date
> The cleaning job consists in finding duplicates caused by typos.
> 
> Here's an (imaginary) example :
> "God bless America"    George W Bush    Houston, March 18 2005
> "Godi bless America"    George W Bush    Huston, March 18 2005
> 
> Typos can occur in any of the 3 items, and sometimes even in 2 or 3
> items of the same line...
> Last but not least, the data consists in about 40000 lines...

How about using the compress function to compare 'pairs' of lines. If the
length of each compressed string is similar and it is more or less the same
as the length of the combined and compressed strings then you've almost
certainly got a 'match'. I haven't done this with thousands of records but I
have done it with hundreds and it's relatively quick.

Terry...




More information about the use-livecode mailing list