to find the first numeric line?

Alex Tweedly alex at tweedly.net
Sun Feb 3 15:30:00 EST 2013


On 02/02/2013 18:56, J. Landman Gay wrote:
> On 2/2/13 3:08 AM, Alex Tweedly wrote:
>> There's no need to decrease kBrute to deal with
>> short lines, because we still do a full search (in this case using
>> filter) when we get down to a block of size kBrute.
>
> Aha. The filter is the part I missed. I get it now. Clever.
>
Hmmm - kinda "clever", as in "cute clever".
But also kinda dumb, as in "why abandon an efficient method for an 
inefficient one, just to save a little bit of thinking" :-)

So I went back, did the thinking, eliminated kBrute (and thankfully the 
need for an unnecessary assumption/requirement to limit line length).  I 
also changed it so that when we are about to test, we always have a 
complete line identified - and so it is much easier to adapt the code 
for other purposes.

And - I changed the problem definition from
     "Return the line number of the first line with a leading numeric"
to
     "return the character number of the first character of the first 
line with a leading numeric"

This is much more efficient to use outside the function, as well as more 
efficient inside the function. The caller can now do something like

     put getCharacterPosition(myData) into tPos
     repeat for each line L in (char tPos to -1 of myData)
        ...

instead of
    put getLinePosition(myData) into tLPos
    repeat for each line L in (line tLPos to -1 of myData)
        ...

or
    put line 1 of (char tPos to -1 of myData) into temp
rather than
   put line tLPos of Mydata into temp

Both of these are much more efficient (will save 10s or 100s of 
millisecs in large data sets).

With these changes, the binary search can now do 10M lines of input in 
typically <1 msec
Even the worst case I can think of (all 300 Million characters are in a 
single line, no CRs anywhere), it gets a result in 1800 msec.

So - here's the code for anyone who needs an efficient binary search of 
large string data.
-- Alex.

> function f3 @pD
>    local theResult
>    local tLower, tUpper, tMid, c, t
>
>    -- NB we are always going to maintain theResult as a feasible answer
>    -- if there is no 'leading numeric' line, we answer with a number 
> beyond the input data
>    put (the number of chars in pD) + 1 into theResult
>
>    put 1 into tLower
>    put theResult into tUpper
>    put trunc( (tUpper + tlower ) / 2) into tMid
>
>    repeat 10000 times
>       -- check for a CR in the upper half
>       put the offset(CR, char tMid to tUpper of pD) into t
>       if t = 0 or tMid + t = tUpper then
>          -- there is no CR in upper half
>          -- move tUpper down to near tMid.
>          -- Can't just move it down to exactly tMid because of the 
> corner case
>          -- where tMid to tUpper exactly spans the last line
>          put tMid+1 into tUpper
>       else
>          -- there is a CR in this upper half, move tMid to just beyond it
>          add t to tMid
>          -- now tMid points to the start of a line
>          -- do the test here - easy to adapt
>          put char tMid of pD into c
>          if c is a number then
>             put tMid into tUpper
>             put tMid into theResult
>          else   --
>             put tMid into tLower
>          end if
>       end if
>
>       put trunc( (tUpper + tLower ) / 2) into tMid
>       if tMid = tLower then
>          -- we're done, just decide if this single remaining line is a 
> candidate for the answer
>          if char tMid of pD is a number then  -- do the test again !!
>             put tMid into theResult
>          end if
>          exit repeat
>       end if
>
>       if tMid = tLower+1 then
>          -- tLower must point to the first char of an empty lie, so we 
> can increment over it
>          put tMid into tLower
>       end if
>    end repeat
>
>    return theResult
>
> end f3





More information about the use-livecode mailing list