to find the first numeric line?

Alex Tweedly alex at tweedly.net
Fri Feb 1 19:41:01 EST 2013


I just can't resist a good benchmark / coding challenge.

Summary - for small data (10s of thousand of lines), they're all fairly 
quick; for large data sets, there are big performance differences and 
how far through the data set the "start-with-numeric" lines begin.

I used 10 million lines of data, with 'numeric' staring at either line 
10,000 or 9,999,000

simple repeat loop - clear, easy to understand, relatively slow but 
times vary widely
filter without ... - fairly clear, generally a little bit quicker, 
consistent times
binary search using character offset - hard to understand, MUCH faster, 
consistent times

specifically

repeat       between       4 and  3809 ms
filter          between 2047 and 3602
binary        between      0  and   230

For the binary search, the coding is slightly tricky. You can avoid some 
of the edge cases by making an assumption that there is a max length for 
any line (I assumed 5,000 characters, see kBrute) - but if you can't 
make any such assumption, then you can handle it - just need to handle a 
couple of special cases. I've left that as an exercise :-)

Here's the code I used   (the data is in global gData)

> global gData
> on mouseup
>    local t1, t2, K
>
>    put the millisecs into t1
>    put f1(gData) into t2
>    put t2 &&  the millisecs - t1 & CR  after msg
>
>    put the millisecs into t1
>    put f2(gData) into t2
>    put t2 &&  the millisecs - t1 & CR  after msg
>
>    put the millisecs into t1
>    put f3(gData) into t2
>    put t2 &&  the millisecs - t1 & CR  after msg
>
> end mouseup
>
> function f1 @pD
>    local r
>    repeat for each line L in pD
>       if char 1 of L is a number then exit repeat
>       add 1 to r
>    end repeat
>    return r + 1
> end f1
>
> function f2 pD
>    filter pD without "[0-9]*"
>    return the number of lines in pD + 1
> end f2
>
> function f3 @pD
>    local r, tLower, tUpper, tMid, c, t, temp
>    constant kBrute = 10000
>
>    put 1 into tLower
>    put (the number of chars in pD) + 1 into tUpper
>
>    repeat 10000 times -- more than enough :-)
>       if tUpper - tLower < kBrute then exit repeat
>       put (tUpper + tlower ) / 2 into tMid
>       put char 1 of line 2 of (char tMid to tUpper of pD) into c
>       if c is a number then
>          put tMid into tUpper
>       else
>          put tMid into tLower
>       end if
>    end repeat
>
>    put char tLower to tUpper of pD into temp
>    filter temp without "[0-9]*"
>    if temp is empty then
>       -- no lines start with numeric
>       return "No lines"
>    else
>       put the number of lines in (char  1 to tLower of pD) into t
>       add the number of lines in temp to t
>       return t
>    end if
> end f3

-- Alex.







More information about the use-livecode mailing list