[somewhat OT] Text processing question (sort of)

yoy yoy at comcast.net
Sun May 18 15:11:37 EDT 2008


That works but if you pass the data to PERL it'll be handled 10,000 times 
faster. Not that I can remember how I did it! :(

I have a great memory, it's just short.

Best,

Andy

----- Original Message ----- 
From: "jbv" <jbv.silences at club-internet.fr>
To: "How to use Revolution" <use-revolution at lists.runrev.com>
Sent: Sunday, May 18, 2008 2:27 PM
Subject: Re: [somewhat OT] Text processing question (sort of)


>
> if anyone is interested, while trying to find the fastest way to compare
> each line of a list with every other line, I found the following technique
> quite fast :
>
> -- myData contains the 40000 lines to chack
> -- myData1 is a duplicate of myData
>
> put myData into myData1
>
> repeat for each line j in myData
>  delete line 1 of myData1
>  repeat for each line i in myData1
>  end repeat
> end repeat
>
>
>> Hi list,
>>
>> I've been asked to do some "cleaning" in a client's data, and am trying
>> to figure out some simple and fast algorithm to do the job in Rev, but
>> haven't got much success so far...
>>
>> Here's the problem : the data consists in a collection of quotations by
>> various writers, politicians, etc. The data is organized in lines of 3
>> items :
>> the quote, the author, the place & date
>> The cleaning job consists in finding duplicates caused by typos.
>>
>> Here's an (imaginary) example :
>> "God bless America"    George W Bush    Houston, March 18 2005
>> "Godi bless America"    George W Bush    Huston, March 18 2005
>>
>> Typos can occur in any of the 3 items, and sometimes even in 2 or 3
>> items of the same line...
>> Last but not least, the data consists in about 40000 lines...
>>
>> The first idea that comes to mind is a kind of brute force approach :
>> to compare each line, item by item, with each of the other lines,
>> compute
>> a ratio of identical words, and keep only lines where the ratio found
>> for
>> each item is above a certain threshold (say 80%)... The problem with
>> such
>> huge set of data, is that it might take forever...
>>
>> I've also tried to sort lines and compare each line with the previous
>> one only,
>> but if the typo occurs in the first char of any item of a line,
>> duplicates might be
>> far away from each other after the sort... so it won't work...
>>
>> Any idea ?
>>
>> thanks in advance,
>> JB
>>
>> _______________________________________________
>> use-revolution mailing list
>> use-revolution at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your 
>> subscription preferences:
>
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your 
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution 




More information about the use-livecode mailing list