Comparing big lists

Dave Cragg dcragg at lacscentre.co.uk
Mon Apr 29 08:46:01 EDT 2002


At 6:48 pm -0700 28/4/02, erik hansen wrote:
>how about a short example?

Not so short. :)

Here's a "useless" example to illustrate the cost of incrementing 
chunk expressions in a long loop vs using "repeat for each". It 
simply copies a list, first using chunk expressions, next using 
"repeat for each" Vary the value of tListSize to see how the first 
method gets exponentially slower, whereas the second method's time 
grows proportionally.

on mouseUp
   #build a list of numbers
   put 1000 into tListSize
   repeat tListSize
     put random(1000000) & cr after tList
   end repeat
   delete char -1 of tList

   put empty into tOutList
   #test 1
   put the milliseconds into tStart
   put the number of lines of tList into tNumLines
   repeat with i = 1 to tNumLines
     put line i of tList & cr after tOutList
   end repeat
   delete char -1 of tOutlist
   put the milliseconds - tStart into tTime
   put "Test 1:" && tTime into tResult
   put empty into tOutList
   #test 2
   put the milliseconds into tStart
   repeat for each line tLine in tList
     put tLine & cr after tOutList
   end repeat
   delete char -1 of tOutlist
   put the milliseconds - tStart into tTime
   put cr & "Test 2:" && tTime after tResult
   answer tResult
end mouseUp

Here's an example that "kind of" deals with Greg's original question. 
This builds two lists of numbers, one of two columns and one of 6. It 
rigs the columns to be compared by prepending and appending an "x". 
This prevents lineOffset in the first method finding the value in the 
wrong column, and avoids the problem of substrings matching. The 
value in the second column of the small list is unique. (I take it 
from Greg's mail that his data was something like this, but in 
general, you'll need to do more work with both methods.)

The first method uses lineOffset in a loop. The second method builds 
an array from one list, and then loops through the second list line 
at a time checking if there is an array entry for the item being 
compared. The time difference is substantial.

This doesn't mean there isn't a place for using the offset functions. 
For example, if searching how many times a single word occurs in some 
text, it works fine, and there's no overhead of building an array. 
But when comparing two large lists,  it can get real slow.

on mouseUp
   #build two lists for testing
   put 10000 into tLargeListSize
   put 1000 into tSmallListSize
   repeat tLargeListSize
     put empty into tLargeLine
     repeat 6
       put random(1000000)  & comma after tLargeLine
     end repeat
     put "x" before item 3 of tLargeLine
     put "x" after item 3 of tLargeLine
     delete char -1 of tLargeLine
     put tLargeLine & cr after tLargeList
   end repeat
   delete char -1 of tLargeList

   repeat with i = 1 to tSmallListSize
     put empty into tSmallLine
     put random(10000) & comma & "x" & i & "x" after tSmallLine
     put tSmallLine & cr after tSmallList
   end repeat
   delete char -1 of tSmallList
   sort lines of tSmallList by item 1 of each ##random sort

   #start test
   #test 1 -- using lineOffset
   put the milliseconds into tStart
   ----------
   repeat for each line j in tLargeList
     put lineOffset(item 3 of j, tSmallList) into tOff
     if tOff is not 0 then
       put line tOff of tSmallList & ":" & j  & cr after tMergeList1
     end if
   end repeat
   delete char -1 of tMergeList1
   ------------
   put the milliseconds - tStart into tTime
   put "Test 1:" && tTime into tResult

   #test 2 -- using an array
   put the milliseconds into tStart
   -----------------
   #build array of item 2 of small list
   repeat for each line tLine in tSmallList
     put tLine into tArray[item 2 of tLine] ##to make tests comparable
   end repeat

   repeat for each line tLine in tLargeList
     if tArray[item 3 of tLine] <> empty then
       put tArray[item 3 of tLine] & ":" & tLine & cr after tMergeList2

     end if
   end repeat
   delete char -1 of tMergeList2
   ---------------------
   put the milliseconds - tStart into tTime
   put cr & "Test 2:" && tTime after tResult
   #put tMergeList1 into field 1
   #put tMergeList2 into field 2

   answer tResult
end mouseUp



Cheers
Dave Cragg



More information about the metacard mailing list