Extracting a column

Jim Ault JimAultWins at yahoo.com
Mon Dec 10 12:59:52 EST 2007


On 12/10/07 8:24 AM, "Richard Gaskin" <ambassador at fourthworld.com> wrote:
> While the relative benchmarks favor "repeat for each", in absolute terms
> being able to extract a column from half a million lines per second
> isn't bad. :)
> 
> Here's the code - please let me know if I've missed something here which
> may be skewing the results:

The type of file you are parsing could be one of the determining factors.
Try generating a specific file structure

put empty in tData
put 4 into tCol  
--  run trials with tCol = 44, 444
put "A" into item tCol of tTemp
--  run trials with a sentence as item 1

repeat 100000
    put tTemp & cr after tData
end repeat

--number of cols, and the length of the content before the column for
extracting could be the biggest factor.  Col 2 extract could be a lot faster
than col 9.  In most cases, knowing which column(s) you wish to extract will
mean you adjust your file format to put these closest to the first item.  If
you inherit the data or don't have a choice... c'est la guerre.

Jim  Ault
Las Vegas


On 12/10/07 8:24 AM, "Richard Gaskin" <ambassador at fourthworld.com> wrote:

> Klaus Major wrote:
> 
>>> Does somebody know if there is a "quick" way to extract a column
>>> from a tab limited list (in a field or a variable)?
>>> By "quick" I mean I'm not obliged to cycle through all the lines of
>>> my var because it can be quite long.
>>> 
>>> I've tried to use array but the transpose function doesn't work if
>>> the number of columns is not the same as the number of lines.
>>> Or can I do something else with an array to achieve that goal?
>> 
>> use the new "split" command!
>> ...
>> put "your data here" into myvar
>> put 2 into my_column
>> ## The number of column you want to extract
>> split myvar by column
>> ## turns your data into an array!
>> put myvar[my_column] into my_column_data
>> ...
> 
> Well done, Klaus.  I'd forgotten that the "split" command has been
> extended with the "column" token, and since I have a data management
> library that I use in a number of apps I decided to test this against
> the "repeat for each line" method I'm currently using.
> 
> It seems that even with the convenience of the new form of "split", the
> "repeat for each line" method is still faster - here are the results of
> this morning's test:
> 
>    Split: 1101 ms (490.46 lines/ms)
>    Repeat: 499 ms (1082.16 lines/ms)
>    Same results?: true
> 
> (MacBook Pro 2.16GHz, OS X 10.4.11)
> 
> While the relative benchmarks favor "repeat for each", in absolute terms
> being able to extract a column from half a million lines per second
> isn't bad. :)
> 
> 
> Here's the code - please let me know if I've missed something here which
> may be skewing the results:
> 
> on mouseUp
>    set cursor to watch
>    --
>    -- Number of times to run the test:
>    put 1000 into n
>    --
>    -- "src" contains a tab-delimited list of 540 lines:
>    put fld  "src" into tData
>    --
>    -- TEST 1: split
>    put the millisecs into t
>    repeat n
>      put GetCol1(tData, 2) into tmp1
>    end repeat
>    put the millisecs - t into t1
>    --
>    -- TEST 2: repeat for each:
>    put the millisecs into t
>    repeat n
>      put GetCol2(tData, 2) into tmp2
>    end repeat
>    put the millisecs - t into t2
>    --
>    -- Display results:
>    put tmp1 into fld "r1"
>    put tmp2 into fld "r2"
>    --
>    -- Display times and verify that the
>    -- results are the same:
>    put N * the number of lines of tData into x
>    set the numberformat to "0.##"
>    put "Split: "&t1 &" ms ("& x/t1 &" lines/ms)"& \
>        cr& "Repeat: "&t2 &" ms ("& x/t2 &" lines/ms)"&\
>        cr&"Same results?: "&(tmp1 = tmp2)
> end mouseUp
> 
> --
> --  TEST 1: split
> --
> function GetCol1 pData, pCol
>    split pData by column
>    return pData[pCol]
> end GetCol1
> 
> --
> -- TEST 2: repeat for each
> --
> function GetCol2 pData, pCol
>    put empty into tVal
>    set the itemdel to tab
>    repeat for each line tLine in pData
>      put item pCol of tLine &cr after tVal
>    end repeat
>    delete last char of tVal
>    return tVal
> end GetCol2
> 
> 
> 
> My test stack with a 540-line source field is at:
> 
>     go url "http://fourthworldlabs.com/getcol_test.rev"
> 





More information about the use-livecode mailing list