Extracting a column

Richard Gaskin ambassador at fourthworld.com
Mon Dec 10 14:39:34 EST 2007

Jim Ault wrote:
> --number of cols, and the length of the content before the column for
> extracting could be the biggest factor.  Col 2 extract could be a lot faster
> than col 9.  In most cases, knowing which column(s) you wish to extract will
> mean you adjust your file format to put these closest to the first item.  If
> you inherit the data or don't have a choice... c'est la guerre.

Excellent thoughts.

I modified the test to generate data rather than using the canned data 
supplied, adding this near the top of the test handler:

   put "4" into t
   -- Make cols:
   put empty into tRow
   repeat with i = 1 to 500
     put s & t into s
     put s & tab after tRow
   end repeat
   -- make rows:
   put empty into tData
   repeat with i = 1 to 500
     put tRow &cr after tData
   end repeat
   delete last char of tData
   -- Verify sizes:
   set the itemdel to tab
   answer "Cols: "&the number of items of line 1 of tData &\
       cr&"Rows: "& the number of lines of tData &\
       cr&"Size: "&len(tData)

This gave me a data set of 500 cols with 500 rows, with each column 
containing one more character than the last, the longest being 501 
chars, with a total size of 63,125,499 chars.  I left the functions 
themselves unchanged.

Having it get column 490 gave me these results:

   Split: 32110 ms (0.16 lines/ms)
   Repeat: 3946 ms (1.27 lines/ms)
   Same results?: true

Getting column 2 gave me:

   Split: 39192 ms (0.13 lines/ms)
   Repeat: 2495 ms (2 lines/ms)
   Same results?: true

So then I tried a very horizontal data set of just 20 rows but with 2000 
columns in each, for a total size of 40,100,019 chars.

Grabbing column 1999 from this data set gave me:

   Split:  7849 ms (0.03 lines/ms)
   Repeat: 2328 ms (0.09 lines/ms)
   Same results?: true

So I think what we're seeing is that the overhead of parsing applies to 
both methods.

On the one hand, the "split" command ramps more gracefully the more 
horizontal the data gets when accessing items at the end of each row, 
but on the other hand its performance remains roughly the same no matter 
which item is obtained while "repeat for each" shows improvement with 
items closer to the left.  And in all cases tested, "repeat for each" 
continues to best "split" in overall performance.

I imagine we could come up with a data set for which "split" outperforms 
"repeat for each", but my data sets are well under 20 MBs each (more 
commonly < 5 MBs), almost never exceeding 150 columns and each column in 
a given row would very rarely contain more than 1k, so these tests cover 
most real-world scenarios for my needs.

Just the same, if someone comes up with a real-world scenario in which 
"split" outperforms "repeat for each" I'd be very interested in learning 
what that data looks like and how it's used.

  Richard Gaskin
  Managing Editor, revJournal
  Rev tips, tutorials and more: http://www.revJournal.com

More information about the Use-livecode mailing list