Comparing big lists

Scott Raney raney at metacard.com
Fri Apr 26 14:52:01 EDT 2002


On: Thu, 25 Apr 2002 Gregory Lypny <gregory.lypny at videotron.ca> wrote:

>   Thought I would pick your brains on the topic of comparing two big
> lists.  Both are tab delimited.  bigList has about 100,000 lines and
> 6 items (columns) per line.  smallList is about 15,000 lines and 2
> items per line.  I want to identify the lines in bigList in which
> the third item is the same as the second item in a line in
> smallList, and then pull out the intersection.  I used something
> like this, which works fine.

>     set the itemDelimiter to tab
>               repeat for each line j of smallList
>                    put lineOffset(item 2 of j, bigList) into thisLine
>                    if thisLine is not 0 then put j & tab & \
>                         line thisLine of bigList  & return after mergedList
>               end repeat
>     delete last character of mergedList  -- Get rid of the trailing Return

> Using the lineOffset function seemed the obvious choice to me, but I'm
> also interested in other approaches.

LineOffset on such a big variable is going to be pretty expensive.
Another option would be to us split to build an array out of smallList
and the loop over each line in big list and see if there is an array
index for it.  Split takes awhile and will use up a good bit of
memory, but makes the lookups *much* faster.  You could save some of
that space by building up an array of just the relevant items in one
list or the other by looping over the lines and creating one array
index for each.
  Regards,
    Scott

>     Regards,
>         Greg

********************************************************
Scott Raney  raney at metacard.com  http://www.metacard.com
MetaCard: You know, there's an easier way to do that...




More information about the metacard mailing list