How to filter a big list
Alex Tweedly
alex at tweedly.net
Thu Oct 22 18:33:20 EDT 2009
Richard Gaskin wrote:
> Jérôme Rosat wrote:
>> I explained in my message that I wish to filter a list of names and
>> addresses dynamically when I type a name in a field. This list
>> contains 400'000 lines like this: Mme [TAB] DOS SANTOS albertina
>> [TAB] rue GOURGAS 23BIS [TAB] 1205 Genève
>>
>> I made various tests using the "repeat for each" loop and the
>> "filter ... with" command. Filtering takes the most time when I type
>> the first and the second letter. That takes approximately 800
>> milliseconds for the first char and about 570 milliseconds for the
>> second char. The repeat loop with the "contains" operator is a
>> little beat slower (about 50 milliseconds) than the "filter ...
>> with". There is no significant difference when the third char or
>> more is typed. Of course I filter a variable before to put it in the
>> list field.
>>
>> Obviously, 800 milliseconds to filter a list of 400'000 lines, it is
>> fast. But it is too slow for what I want to do. It would take a time
>> of filtering lower than 300 milliseconds so that the user is not
>> slowed down in his typing.
>
> Would it be practical to break your list into 26 sublists by first
> letter?
That's a pragmatic approach - but I think it's the wrong one.
The fundamental problem is that the idea of scanning an entire list at
keystroke speed is not robust. Even if splitting into multiple lists
works for now, there's no guarantee that it will work tomorrow - when
the database doubles in size, or the data becomes skewed because it
contains too many people with the same first letter, or .... or the
users demand a similar feature for address as well as surname, or they
want to match string anywhere within the name, or ....
What you ought to do (imnsho) is to change the algorithm to one which is
inherently responsive, using either 'send' or 'wait-with-messages' to
ensure that this matching process does not interfere with
responsiveness. In this case, I think it's easier to use wait-with-messages.
So in outline
each time the match data changes, you restart the matching process
the matching process checks a fixed, and relatively small, number of
possible matches
updates the field showing the user what matches have been found
and then allows other things to happen before continuing with the
matching.
I'd have a single handler that is always called when any changes happens
to the user input, which can kick off a new matching process (by sending
to the match handler). Then within the handler, I'd periodically check
whether there is a pending message to restart a new handler.
So a brief version of the whole script would be
> local sShow, sStart, sData, sFound,sMatch
> global gData
>
> on keyUp
> matchStringHasChanged
> pass keyUp
> end keyUp
>
> on matchStringHasChanged
> send "processamatch" to me in 0 millisecs
> end matchStringHasChanged
>
> on processamatch
> local tCount
>
> put gData into sData
> put the text of field "Field" into sMatch
>
> put empty into field "Show"
> put empty into sShow
>
> repeat for each line L in sData
> add 1 to tCount
> if L begins with sMatch then
> put L &CR after sShow
> end if
> if tCount mod 100 = 0 then
> put sShow & "....." & CR into field "Show"
> wait 0 millisecs with messages
> if the pendingmessages contains ",processamatch," then
> put "exiting" & CR after field "StatusLog"
> exit processamatch
> end if
> end if
> end repeat
> put sShow into field "Show"
> put "Completed" && the number of lines in sShow &CR after field
> "StatusLog"
> end processamatch
>
Note the use of "......" to give an indication that work is still in
progress and there may be more matches to come.
You could easily add refinements to this
1a. if a matching process has completed (rather than exited), and if
previous match string was a substring of the new matchstring, then
instead of starting with
put gData into sData
you could instead do
put sShow into sData
(i.e. re-use the filtered list - but be sure to remember that if you
exit before completing, or if the matchstring changes in any other way
you need to restart with the whole of gData)
1b. If you do 1a, then if you are *nearly* complete with a match when
the matchstring changes, then just go ahead and complete it, so you get
to work on the subset.
(good luck deciding what *nearly* means :-)
btw - I don't think there is any magic 'split'-based method possible here.
-- Alex.
More information about the use-livecode
mailing list