[somewhat OT] Text processing question (sort of)

jbv jbv.silences at club-internet.fr
Mon May 19 10:00:34 EDT 2008


Hi,

Finally I implemented the option quoted below...
As the content of the data to process are in french, I had to consider
more characters (éèê etc) for the letter counts.

As data is organized as lines with 3 items in each, I ended up with
40000 lines, each featuring also 3 items, each item containing the
letters counts for a, b, c, d... separated by spaces.

Then I sorted lines as follows :

 repeat with j=3 down to 1
  repeat with i=charsNumber down to 1
   sort lines of myLetterCounts ascending numeric by word i of item j of each
  end repeat
 end repeat

Then, I automated the duplicates detections as follows :

 get line 1 of myLetterCounts
 delete line 1 of myLetterCounts

 put "" into myResult
 put 5 into nCount -- used to adjust the detection threshold

 repeat for each line j in myLetterCounts
  put 0 into a
  put 0 into b
  put item 1 of j into c
  put item 2 of j into d
  repeat with w=1 to nST
   add abs(word w of item 1 of it - word w of c) to a
   add abs(word w of item 2 of it - word w of d) to b
  end repeat
  if a >0 and b > 0 and a <= nCount and b <= nCount then
   put it &cr& j &cr after myResult
   get last item of it
   get lineoffset(tab & last item of it &cr,ListRef)
   if it>0 then
    put line it of ListRef & cr after myResult
   end if
   get lineoffset(tab & last item of j &cr,ListRef)
   if it>0 then
    put line it of ListRef & cr after myResult
   end if
  end if
  get j
 end repeat

At first glance, I don't think many duplicates escaped the detection...

The 40000 lines were processed in about 7 minutes on an old Mac G3...

Thanks again for the tip,
JB

> Interesting problem.
>
> if you are looking for typos, here are my thoughts.
>
> What are the probable errors? Seems to me you have:
> 1. Typos in individual words
> 2. Extra spaces in individual words (so that you end up with two words
> instead of one)
> 3. Punctuation differences
> 4. Perhaps words such as; "the", "and", "an" missing from titles.
>
> I think I would generate a letter count for each quotation.
>
> For your example:
> "God bless America"    George W Bush    Houston, March 18 2005
> "Godi bless America"    George W Bush    Huston, March 18 2005
>
> The quotation letter counts are
> 2 1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "God bless
> America" (2 a's, 1 b, 1 c ...)
> and
> 2 1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 for "Godi bless
> America"
>
> Then if you sort by these number sets and compare to see how similar
> each count is, you;ll get potential matches that you should eyeball.
> For example, These two strings have all but one count exactly the
> same. I'd go through this process multiple times by rotating the first
> count to the rear and re-sorting.
>
> 1 1 0 2 0 1 0 1 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2
> 1 1 0 2 0 1 0 2 0 0 1 1 0 1 0 0 1 2 0 0 0 0 0 0 0 2
>
> and just keep doing that until every letter has had a chance to be the
> first in the sort.
>
> Basically The thing I'd have it do is find pairs of quotes that appear
> to be very similar. Then once you have a huge list of potential pairs,
> have something that displays them to you in pairs so that you can
> quickly tell the interface to delete one or to skip it.
>
> I really do think you are going to want to make no changes to the data
> unless you look at the matches with your eyeballs. You could very
> easily end up with two completely different quotes that are extremely
> similar, just because person B was playing with a famous quote from
> person A.
>
> So long story short, slice and dice the quotes to collect a set of
> pairs that appear to be similar. Then build a flashcard kind of
> interface in RunRev that allows you the human to read the two similar
> quotes and decide whether to delete one or not.
>
> I'd combine brute force with human visuals. 40000 lines seems like a
> small data set for brute force.
>
> Kee Nethery
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:




More information about the use-livecode mailing list