How to filter a big list

Alex Tweedly alex at tweedly.net
Thu Oct 22 18:33:20 EDT 2009


Richard Gaskin wrote:
> Jérôme Rosat wrote:
>> I explained in my message that I wish to filter a list of names and  
>> addresses dynamically when I type a name in a field. This list  
>> contains 400'000 lines like this:  Mme [TAB] DOS SANTOS albertina  
>> [TAB] rue GOURGAS 23BIS [TAB] 1205 Genève
>>
>> I made various tests using the "repeat for each" loop and the  
>> "filter ... with" command. Filtering takes the most time when I type  
>> the first and the second letter. That takes approximately 800  
>> milliseconds for the first char and about 570 milliseconds for the  
>> second char. The repeat loop with the "contains" operator is a 
>> little  beat slower (about 50 milliseconds) than the "filter ... 
>> with". There  is no significant difference when the third char or 
>> more is typed. Of  course I filter a variable before to put it in the 
>> list field.
>>
>> Obviously, 800 milliseconds to filter a list of 400'000 lines, it is  
>> fast. But it is too slow for what I want to do. It would take a time  
>> of filtering lower than 300 milliseconds so that the user is not  
>> slowed down in his typing.
>
> Would it be practical to break your list into 26 sublists by first 
> letter?
That's a pragmatic approach - but I think it's the wrong one.

The fundamental problem is that the idea of scanning an entire list at 
keystroke speed is not robust. Even if splitting into multiple lists 
works for now, there's no guarantee that it will work tomorrow - when 
the database doubles in size, or the data becomes skewed because it 
contains too many people with the same first letter, or .... or the 
users demand a similar feature for address as well as surname, or they 
want to match string anywhere within the name, or ....

What you ought to do (imnsho) is to change the algorithm to one which is 
inherently responsive, using either 'send' or 'wait-with-messages' to 
ensure that this matching process does not interfere with 
responsiveness. In this case, I think it's easier to use wait-with-messages.

So in outline

each time the match data changes, you restart the matching process

the matching process checks a fixed, and relatively small, number of 
possible matches
      updates the field showing the user what matches have been found
      and then allows other things to happen before continuing with the 
matching.

I'd have a single handler that is always called when any changes happens 
to the user input, which can kick off a new matching process (by sending 
to the match handler). Then within the handler, I'd periodically check 
whether there is a pending message to restart a new handler.

So a brief version of the whole script would be

> local sShow, sStart, sData, sFound,sMatch
> global gData
>
> on keyUp
>    matchStringHasChanged
>    pass keyUp
> end keyUp
>
> on matchStringHasChanged
>    send "processamatch" to me in 0 millisecs
> end matchStringHasChanged
>
> on processamatch
>    local tCount
>    
>    put gData into sData
>    put the text of field "Field" into sMatch
>    
>    put empty into field "Show"
>    put empty into sShow
>    
>    repeat for each line L in sData
>       add 1 to tCount
>       if L begins with sMatch then
>          put L &CR after sShow
>       end if
>       if tCount mod 100 = 0 then
>          put sShow & "....." & CR into field "Show"
>          wait 0 millisecs with messages
>          if the pendingmessages contains ",processamatch," then
>             put "exiting" & CR after field "StatusLog"
>             exit processamatch
>          end if
>       end if
>    end repeat
>    put sShow into field "Show"
>    put "Completed" && the number of lines in sShow &CR after field 
> "StatusLog"
> end processamatch
>


Note the use of "......" to give an indication that work is still in 
progress and there may be more matches to come.

You could easily add refinements to this

1a.  if a matching process has completed (rather than exited), and if 
previous match string was a substring of the new matchstring, then 
instead of starting with
          put gData into sData
you could instead do
          put sShow into sData 

(i.e. re-use the filtered list - but be sure to remember that if you 
exit before completing, or if the matchstring changes in any other way 
you need to restart with the whole of gData)

1b. If you do 1a, then if you are *nearly* complete with a match when 
the matchstring changes, then just go ahead and complete it, so you get 
to work on the subset.
(good luck deciding what *nearly* means :-)

btw - I don't think there is any magic 'split'-based method possible here.


-- Alex.



More information about the use-livecode mailing list