[somewhat OT] Text processing question (sort of)
jbv
jbv.silences at club-internet.fr
Sun May 18 14:27:55 EDT 2008
if anyone is interested, while trying to find the fastest way to compare
each line of a list with every other line, I found the following technique
quite fast :
-- myData contains the 40000 lines to chack
-- myData1 is a duplicate of myData
put myData into myData1
repeat for each line j in myData
delete line 1 of myData1
repeat for each line i in myData1
end repeat
end repeat
> Hi list,
>
> I've been asked to do some "cleaning" in a client's data, and am trying
> to figure out some simple and fast algorithm to do the job in Rev, but
> haven't got much success so far...
>
> Here's the problem : the data consists in a collection of quotations by
> various writers, politicians, etc. The data is organized in lines of 3
> items :
> the quote, the author, the place & date
> The cleaning job consists in finding duplicates caused by typos.
>
> Here's an (imaginary) example :
> "God bless America" George W Bush Houston, March 18 2005
> "Godi bless America" George W Bush Huston, March 18 2005
>
> Typos can occur in any of the 3 items, and sometimes even in 2 or 3
> items of the same line...
> Last but not least, the data consists in about 40000 lines...
>
> The first idea that comes to mind is a kind of brute force approach :
> to compare each line, item by item, with each of the other lines,
> compute
> a ratio of identical words, and keep only lines where the ratio found
> for
> each item is above a certain threshold (say 80%)... The problem with
> such
> huge set of data, is that it might take forever...
>
> I've also tried to sort lines and compare each line with the previous
> one only,
> but if the typo occurs in the first char of any item of a line,
> duplicates might be
> far away from each other after the sort... so it won't work...
>
> Any idea ?
>
> thanks in advance,
> JB
>
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
More information about the use-livecode
mailing list