Comparing Big Lists

Gregory Lypny gregory.lypny at videotron.ca
Mon Apr 29 13:25:01 EDT 2002


Hi Scott,

	I tried your suggestion of turning smallList into an associative 
array with the index for each element equal to the text I'm looking for 
in bigList.  I think I must have misunderstood your suggestion because 
the handler runs much slower than previously, perhaps because I've got 
it asking for the keys of smallList for every line of bigList.  Here's 
what I tried.

-- Note. smallListArray array is an array made out of the original 
smallList variable

repeat for each line i in bigList
     if item 6 of i  keys(smallListArray)
     then
       put i into hitList[item 6 of i]
     end if
   end repeat


Message: 3
Subject: Re: Comparing big lists
Date: Sat, 27 Apr 2002 16:10:42 -0400
From: Gregory Lypny <gregory.lypny at videotron.ca>
To: "MetaCard List" <metacard at lists.runrev.com>
Reply-To: metacard at lists.runrev.com

Thanks for the suggestion, Scott.  I'll give it a shot.  I've also tried
looping over the lines of bigList (i.e., a nested repeat), simply using
the 'in' operator:  if x is in y, then...  It takes about 6 minutes on a
modest (300 mHz) iBook running OS X, but I'm hoping for an improvement,

      Regards,

           Greg

On 27/4/2002 12:08 PM, metacard-request at lists.runrev.com wrote:

Message: 2
Date: Fri, 26 Apr 2002 12:48:53 -0600 (MDT)
From: Scott Raney <raney at metacard.com>
To: metacard at lists.runrev.com
Subject: Re: Comparing big lists
Reply-To: metacard at lists.runrev.com

On: Thu, 25 Apr 2002 Gregory Lypny <gregory.lypny at videotron.ca> wrote:

   Thought I would pick your brains on the topic of comparing two big
lists.  Both are tab delimited.  bigList has about 100,000 lines and
6 items (columns) per line.  smallList is about 15,000 lines and 2
items per line.  I want to identify the lines in bigList in which
the third item is the same as the second item in a line in
smallList, and then pull out the intersection.  I used something
like this, which works fine.

     set the itemDelimiter to tab
               repeat for each line j of smallList
                    put lineOffset(item 2 of j, bigList) into thisLine
                    if thisLine is not 0 then put j & tab & \
                         line thisLine of bigList  & return after 
mergedList
               end repeat
     delete last character of mergedList  -- Get rid of the trailing 
Return

Using the lineOffset function seemed the obvious choice to me, but I'm
also interested in other approaches.

LineOffset on such a big variable is going to be pretty expensive.
Another option would be to us split to build an array out of smallList
and the loop over each line in big list and see if there is an array
index for it.  Split takes awhile and will use up a good bit of
memory, but makes the lookups *much* faster.  You could save some of
that space by building up an array of just the relevant items in one
list or the other by looping over the lines and creating one array
index for each.
  Regards,
    Scott

     Regards,
         Greg




More information about the metacard mailing list