indexing slow-down (i.e., speeding up programs)

Pierre Sahores psahores at easynet.fr
Sat Apr 12 20:34:01 EDT 2003


John Vokey a écrit :
> 
>    Here's my task:  I've got two LARGE text files, one a spelling
> dictionary (from Excalibur) and the other a dictionary of lexical
> frequencies (i.w., item and frequency); there over 113,000 unique words
> (as lines) in the first, and about a million different entries (as
> separate lines) in the second, many not regular words (e.g., dates,
> numbers, etc.); furthermore, in the second, the same word is listed
> multiple times depending on use (e.g., ``rose'' as a noun and ``rose''
> as a verb).  I want to extract the lexical frequency (ignoring use
> type) of each of the words in the first from the second, including
> assigning a frequency of zero for those from the first not found in the
> second.
> 
>    Fortunately, both are already alphabetised, so as I move sequentially
> through the first, I can simply start searching from where I last left
> off in the second, summing over multiple use listings of the same word.
>   So far so good.   I use the ``repeat for each ... in ... '' construct
> for the first dictionary, and a variable pointer for the second that I
> advance as I find each word from the first (I call a simple function
> that skips over non-words and advances the pointer, returning the line
> of the next word in the lexical dictionary).  Both text files are read
> into memory at the start of the routine (which takes less than a second
> using the'' url file://...'' command (way to go, metacard!).
> 
>    Here's my problem: initially, the routine takes much less than 1
> second per hundred words (which, when multiplied by the number of words
> remaining to index, results in an estimate of some small fraction of an
> hour to do the whole task).  However, it rapidly (as an exponential
> function) slows down, so that by the time it reaches the middle of the
> alphabet (M), it takes many minutes per 100 words, and an
> ever-increasing time estimate for the remaining items (now over 60
> hours!).  Clearly, either the ``repeat for each'' command for the first
> dictionary  or the ``get line pointer...'' for the second (or both)
> get(s) slower and slower as I progress through the dictionaries,
> presumably because to do one or the other (or both), metacard counts
> carriage returns from the beginning of the dictionary.
> 
>    I've tried: deleting the items from the front of the lexical
> dictionary as I've searched them, so that each new search starts at
> line 1 of the edited list [i.e., put line pointer to (the number of
> lines of dict2) of dict2 into dict2)], but that slows it down even more
> (presumably because metacard needs to count crs to find the number of
> the last line each time).  I've tried various machinations of
> offset(,,skip) using the skip variable as an absolute pointer, but it
> also appeared to make things slower presumably because it counts chars
> from the start of the dictionary.  I guess I could divide dict2 (or
> dict1) into many small segments, moving to each successive segment as
> the previous one was exhausted, but I was hoping for something more
> elegant.
> 
>    What I need are *absolute* pointers (preferably a memory address, or
> a pointer or a handle to such), rather than the relative (to the
> beginning of the list) pointers given by the line (and possibly ``for
> each'') construct.  Arrays presumably would work (but doesn't metacard
> then have to search the indices to resolve the array reference?), and
> reading the million length file into the array just sets the problem
> back one step.
> 
>    Any suggestions would be appreciated.  As I receive the list in
> digest form, if you have a scathingly brilliant idea, please send a
> copy of it directly to my email address.  TIA
> --
> John R. Vokey, Ph.D.                       |\      _,,,---,,_
> Professor                                  /,`.-'`'    -.  ;-;;,_
> Department of Psychology and Neuroscience  |,4-  ) )-,_. ,\ (  `'-'
> University of Lethbridge                  '---''(_/--'  `-'\_)
> 
> _______________________________________________
> metacard mailing list
> metacard at lists.runrev.com
> http://lists.runrev.com/mailman/listinfo/metacard

Hi,

To get good speed results in parsing such kind of datas, you will need
to split each dictionaries in two series of small vars (beetwin 50 to
100 Ko each could be a good choice) and build arrays-based indexes to
access the two collections of splited vars (B-Trees nodes/keys and
transposition tables algorithms needed for best results). The best is to
do this once, at boot in a "on openstack" handler. I used this
technique, years ago under Hypercard 2.41/G3 266, and got very good
speed results in parsing multiple dictonaries up to 6 Mo each.
--
Cordialement, Pierre Sahores

Inspection académique de Seine-Saint-Denis.
Applications et bases de données WEB et VPN
Qualifier et produire l'avantage compétitif



More information about the metacard mailing list