Extracting a column
Richard Gaskin
ambassador at fourthworld.com
Mon Dec 10 14:39:34 EST 2007
Jim Ault wrote:
> --number of cols, and the length of the content before the column for
> extracting could be the biggest factor. Col 2 extract could be a lot faster
> than col 9. In most cases, knowing which column(s) you wish to extract will
> mean you adjust your file format to put these closest to the first item. If
> you inherit the data or don't have a choice... c'est la guerre.
Excellent thoughts.
I modified the test to generate data rather than using the canned data
supplied, adding this near the top of the test handler:
put "4" into t
-- Make cols:
put empty into tRow
repeat with i = 1 to 500
put s & t into s
put s & tab after tRow
end repeat
-- make rows:
put empty into tData
repeat with i = 1 to 500
put tRow &cr after tData
end repeat
delete last char of tData
-- Verify sizes:
set the itemdel to tab
answer "Cols: "&the number of items of line 1 of tData &\
cr&"Rows: "& the number of lines of tData &\
cr&"Size: "&len(tData)
This gave me a data set of 500 cols with 500 rows, with each column
containing one more character than the last, the longest being 501
chars, with a total size of 63,125,499 chars. I left the functions
themselves unchanged.
Having it get column 490 gave me these results:
Split: 32110 ms (0.16 lines/ms)
Repeat: 3946 ms (1.27 lines/ms)
Same results?: true
Getting column 2 gave me:
Split: 39192 ms (0.13 lines/ms)
Repeat: 2495 ms (2 lines/ms)
Same results?: true
So then I tried a very horizontal data set of just 20 rows but with 2000
columns in each, for a total size of 40,100,019 chars.
Grabbing column 1999 from this data set gave me:
Split: 7849 ms (0.03 lines/ms)
Repeat: 2328 ms (0.09 lines/ms)
Same results?: true
So I think what we're seeing is that the overhead of parsing applies to
both methods.
On the one hand, the "split" command ramps more gracefully the more
horizontal the data gets when accessing items at the end of each row,
but on the other hand its performance remains roughly the same no matter
which item is obtained while "repeat for each" shows improvement with
items closer to the left. And in all cases tested, "repeat for each"
continues to best "split" in overall performance.
I imagine we could come up with a data set for which "split" outperforms
"repeat for each", but my data sets are well under 20 MBs each (more
commonly < 5 MBs), almost never exceeding 150 columns and each column in
a given row would very rarely contain more than 1k, so these tests cover
most real-world scenarios for my needs.
Just the same, if someone comes up with a real-world scenario in which
"split" outperforms "repeat for each" I'd be very interested in learning
what that data looks like and how it's used.
--
Richard Gaskin
Managing Editor, revJournal
_______________________________________________________
Rev tips, tutorials and more: http://www.revJournal.com
More information about the use-livecode
mailing list