to find the first numeric line?
Alex Tweedly
alex at tweedly.net
Fri Feb 1 19:41:01 EST 2013
I just can't resist a good benchmark / coding challenge.
Summary - for small data (10s of thousand of lines), they're all fairly
quick; for large data sets, there are big performance differences and
how far through the data set the "start-with-numeric" lines begin.
I used 10 million lines of data, with 'numeric' staring at either line
10,000 or 9,999,000
simple repeat loop - clear, easy to understand, relatively slow but
times vary widely
filter without ... - fairly clear, generally a little bit quicker,
consistent times
binary search using character offset - hard to understand, MUCH faster,
consistent times
specifically
repeat between 4 and 3809 ms
filter between 2047 and 3602
binary between 0 and 230
For the binary search, the coding is slightly tricky. You can avoid some
of the edge cases by making an assumption that there is a max length for
any line (I assumed 5,000 characters, see kBrute) - but if you can't
make any such assumption, then you can handle it - just need to handle a
couple of special cases. I've left that as an exercise :-)
Here's the code I used (the data is in global gData)
> global gData
> on mouseup
> local t1, t2, K
>
> put the millisecs into t1
> put f1(gData) into t2
> put t2 && the millisecs - t1 & CR after msg
>
> put the millisecs into t1
> put f2(gData) into t2
> put t2 && the millisecs - t1 & CR after msg
>
> put the millisecs into t1
> put f3(gData) into t2
> put t2 && the millisecs - t1 & CR after msg
>
> end mouseup
>
> function f1 @pD
> local r
> repeat for each line L in pD
> if char 1 of L is a number then exit repeat
> add 1 to r
> end repeat
> return r + 1
> end f1
>
> function f2 pD
> filter pD without "[0-9]*"
> return the number of lines in pD + 1
> end f2
>
> function f3 @pD
> local r, tLower, tUpper, tMid, c, t, temp
> constant kBrute = 10000
>
> put 1 into tLower
> put (the number of chars in pD) + 1 into tUpper
>
> repeat 10000 times -- more than enough :-)
> if tUpper - tLower < kBrute then exit repeat
> put (tUpper + tlower ) / 2 into tMid
> put char 1 of line 2 of (char tMid to tUpper of pD) into c
> if c is a number then
> put tMid into tUpper
> else
> put tMid into tLower
> end if
> end repeat
>
> put char tLower to tUpper of pD into temp
> filter temp without "[0-9]*"
> if temp is empty then
> -- no lines start with numeric
> return "No lines"
> else
> put the number of lines in (char 1 to tLower of pD) into t
> add the number of lines in temp to t
> return t
> end if
> end f3
-- Alex.
More information about the use-livecode
mailing list