shilling for my feature request [1926]

Alex Tweedly alex at tweedly.net
Sun Aug 22 22:25:20 EDT 2004


At 17:49 22/08/2004 -0700, Mark Brownell wrote:


>On Sunday, August 22, 2004, at 04:27 PM, Alex Tweedly wrote:
>
>>At 15:46 22/08/2004 -0700, Mark Brownell wrote:
>>
>>
>>>So let's say that you have hundreds of points either in an array or 
>>>comma delimited. What would be the fastest way to get the value of the 
>>>item just less than 100 in this following example?
>>>
>>>1, 34, 67, 99, 109, 121, 133, 144, 155, (answer = item 4) -- 99
>>>
>>>Would I have to loop through each one checking each during a repeat for 
>>>the last item less-than 100 or is there a faster way to get the value in 
>>>item 4. Every time I can drop a repeat loop seems to be a way to 
>>>optimize the result based on speed issues.
>>
>>If you have only have hundreds of points, then a loop may not be too bad.
>
>It was just an example to show getting item 4. In some cases I have 
>hundreds out of millions.

So a binary or linear interpolation search would be faster that a simple 
repeat loop.
Just do it !  :-)

<snip>

>Now let's say that I had thousands of message objects in a single document 
>and I wanted to use this to get the numerical points for each message 
>objects start and end points.
>
>put the offsets(myString, "<message>" into zap1
>put the offsets(myString, "</message>" into zap2
>
>now if I use this same function for:
>put the offsets(myString, "<soWhat>" into soWhatList
>
>Let's say that I get ten hits for "<soWhat>" and they occur at several
>numerical locations that turn out to be in message objects 20, 50, 60, 77, 
>200, etc...
>What I do now is parse all 1000 message objects and check each one for the 
>existence of the "<soWhat>" element.
>
>What I would like to do is just go to the ten message objects numerical 
>start and end points and parse them only. I do this by checking to see if 
>a "<soWhat>" number falls between a message objects' start and end points. 
>If that occurs then I would add that message to my smaller list of message 
>objects.
>
>So this: 40, 50, 60, 70 would be used on this:
>
>5, 15, 25, 35, 45, 55, 65, 75, 85, 95, to select items 4,5,6, and 7
>
>the first item 40 happens just after 35 in the larger list at the forth item
>the second item 50 happens just after 45 in the larger list at the fifth item
>the third item 60 happens just after 55 in the larger list at the sixth item
>the forth item 70 happens just after 65 in the larger list at the seventh item
>
>If I know the item number I'm interested in then I can use the value found 
>at that item to get that object. This would be a much faster process than 
>looping through every object to see if it has the secondary object of interest.

The fine details will depend on what (implicit) knowledge you have about 
the data at this point in the algorithm (such as - are the message start 
and message ends already known to match, can they be nested, are there gaps 
between one end and the next start, can soWhat's appear in those gaps, etc.)

But allowing for some fine tuning to adjust for that knowledge, I see no 
reason why you can't do something along the lines of  (excuse the C-like 
notation :-)

  put 1 into nextSoWhat
  put 1 into baseStart
  put 1 into baseEnd
  repeat forever
      if nextSoWhat > the number of items in soWhat then exit repeat
      put item nextSoWhat of soWhat into tLook
      -- find the start of this message
      put binarySearch(zap1, baseStart, len(zap1), tLook) into thisStart
      put thisStart into baseStart
      -- find matching end
      put binarySearch(zap2, baseEnd, len(sap2), tLook) into thisEnd
      put thisEnd into baseEnd
      -- add this message to the list to be processed
      put thisStart after theStartList
      -- now check if there are multiple soWhat's in this message
     repeat forever
        put nextSoWhat+1 into nextSoWhat
        if nextSoWhat > len(soWhat) then exit repeat
        if item nextSoWhat of soWhat > item thisEnd of zap2 then exit repeat
     end repat
  end repeat

Note that if the data is known to have matched start - end messages, you 
don't need the second binary search - you can simply use the same index 
into zap2.

This does K binary searches (or K*2) when there are K messages in the 
result, plus it scans each soWhat within the same message, and the binary 
searches are increasingly fast, because after the first one you don't need 
to start at 1, you start the binary search at the last one found.

btw - if the number of soWhats was much larger than the number of messages 
(which it could be in the abstract case), then instead of scanning the 
soWhat list, you'd turn around and binarySearch(soWhat, nextSoWhat, 
len(soWhat), item thisEnd of zap2)  to very quickly pass over multiple 
soWhat's in the same message without scanning each one.

-- Alex.


More information about the use-livecode mailing list