[somewhat OT] Text processing question (sort of)
jbv
jbv.silences at club-internet.fr
Mon May 19 10:00:34 EDT 2008
Hi,
Finally I implemented the option quoted below...
As the content of the data to process are in french, I had to consider
more characters (éèê etc) for the letter counts.
As data is organized as lines with 3 items in each, I ended up with
40000 lines, each featuring also 3 items, each item containing the
letters counts for a, b, c, d... separated by spaces.
Then I sorted lines as follows :
repeat with j=3 down to 1
repeat with i=charsNumber down to 1
sort lines of myLetterCounts ascending numeric by word i of item j of each
end repeat
end repeat
Then, I automated the duplicates detections as follows :
get line 1 of myLetterCounts
delete line 1 of myLetterCounts
put "" into myResult
put 5 into nCount -- used to adjust the detection threshold
repeat for each line j in myLetterCounts
put 0 into a
put 0 into b
put item 1 of j into c
put item 2 of j into d
repeat with w=1 to nST
add abs(word w of item 1 of it - word w of c) to a
add abs(word w of item 2 of it - word w of d) to b
end repeat
if a >0 and b > 0 and a <= nCount and b <= nCount then
put it &cr& j &cr after myResult
get last item of it
get lineoffset(tab & last item of it &cr,ListRef)
if it>0 then
put line it of ListRef & cr after myResult
end if
get lineoffset(tab & last item of j &cr,ListRef)
if it>0 then
put line it of ListRef & cr after myResult
end if
end if
get j
end repeat
At first glance, I don't think many duplicates escaped the detection...
The 40000 lines were processed in about 7 minutes on an old Mac G3...
Thanks again for the tip,
JB
> Interesting problem.
>
> if you are looking for typos, here are my thoughts.
>
> What are the probable errors? Seems to me you have:
> 1. Typos in individual words
> 2. Extra spaces in individual words (so that you end up with two words
> instead of one)
> 3. Punctuation differences
> 4. Perhaps words such as; "the", "and", "an" missing from titles.
>
> I think I would generate a letter count for each quotation.
>
> For your example:
> "God bless America" George W Bush Houston, March 18 2005
> "Godi bless America" George W Bush Huston, March 18 2005
>
> The quotation letter counts are
> 2 1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "God bless
> America" (2 a's, 1 b, 1 c ...)
> and
> 2 1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "Godi bless
> America"
>
> Then if you sort by these number sets and compare to see how similar
> each count is, you;ll get potential matches that you should eyeball.
> For example, These two strings have all but one count exactly the
> same. I'd go through this process multiple times by rotating the first
> count to the rear and re-sorting.
>
> 1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2
> 1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2
>
> and just keep doing that until every letter has had a chance to be the
> first in the sort.
>
> Basically The thing I'd have it do is find pairs of quotes that appear
> to be very similar. Then once you have a huge list of potential pairs,
> have something that displays them to you in pairs so that you can
> quickly tell the interface to delete one or to skip it.
>
> I really do think you are going to want to make no changes to the data
> unless you look at the matches with your eyeballs. You could very
> easily end up with two completely different quotes that are extremely
> similar, just because person B was playing with a famous quote from
> person A.
>
> So long story short, slice and dice the quotes to collect a set of
> pairs that appear to be similar. Then build a flashcard kind of
> interface in RunRev that allows you the human to read the two similar
> quotes and decide whether to delete one or not.
>
> I'd combine brute force with human visuals. 40000 lines seems like a
> small data set for brute force.
>
> Kee Nethery
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
More information about the use-livecode
mailing list