indexing slow-down (i.e., speeding up programs)

John Vokey vokey at uleth.ca
Sat Apr 12 17:16:00 EDT 2003


   Here's my task:  I've got two LARGE text files, one a spelling 
dictionary (from Excalibur) and the other a dictionary of lexical 
frequencies (i.w., item and frequency); there over 113,000 unique words 
(as lines) in the first, and about a million different entries (as 
separate lines) in the second, many not regular words (e.g., dates, 
numbers, etc.); furthermore, in the second, the same word is listed 
multiple times depending on use (e.g., ``rose'' as a noun and ``rose'' 
as a verb).  I want to extract the lexical frequency (ignoring use 
type) of each of the words in the first from the second, including 
assigning a frequency of zero for those from the first not found in the 
second.

   Fortunately, both are already alphabetised, so as I move sequentially 
through the first, I can simply start searching from where I last left 
off in the second, summing over multiple use listings of the same word. 
  So far so good.   I use the ``repeat for each ... in ... '' construct 
for the first dictionary, and a variable pointer for the second that I 
advance as I find each word from the first (I call a simple function 
that skips over non-words and advances the pointer, returning the line 
of the next word in the lexical dictionary).  Both text files are read 
into memory at the start of the routine (which takes less than a second 
using the'' url file://...'' command (way to go, metacard!).

   Here's my problem: initially, the routine takes much less than 1 
second per hundred words (which, when multiplied by the number of words 
remaining to index, results in an estimate of some small fraction of an 
hour to do the whole task).  However, it rapidly (as an exponential 
function) slows down, so that by the time it reaches the middle of the 
alphabet (M), it takes many minutes per 100 words, and an 
ever-increasing time estimate for the remaining items (now over 60 
hours!).  Clearly, either the ``repeat for each'' command for the first 
dictionary  or the ``get line pointer...'' for the second (or both) 
get(s) slower and slower as I progress through the dictionaries, 
presumably because to do one or the other (or both), metacard counts 
carriage returns from the beginning of the dictionary.

   I've tried: deleting the items from the front of the lexical 
dictionary as I've searched them, so that each new search starts at 
line 1 of the edited list [i.e., put line pointer to (the number of 
lines of dict2) of dict2 into dict2)], but that slows it down even more 
(presumably because metacard needs to count crs to find the number of 
the last line each time).  I've tried various machinations of 
offset(,,skip) using the skip variable as an absolute pointer, but it 
also appeared to make things slower presumably because it counts chars 
from the start of the dictionary.  I guess I could divide dict2 (or 
dict1) into many small segments, moving to each successive segment as 
the previous one was exhausted, but I was hoping for something more 
elegant.

   What I need are *absolute* pointers (preferably a memory address, or 
a pointer or a handle to such), rather than the relative (to the 
beginning of the list) pointers given by the line (and possibly ``for 
each'') construct.  Arrays presumably would work (but doesn't metacard 
then have to search the indices to resolve the array reference?), and 
reading the million length file into the array just sets the problem 
back one step.

   Any suggestions would be appreciated.  As I receive the list in 
digest form, if you have a scathingly brilliant idea, please send a 
copy of it directly to my email address.  TIA
-- 
John R. Vokey, Ph.D.                       |\      _,,,---,,_
Professor                                  /,`.-'`'    -.  ;-;;,_
Department of Psychology and Neuroscience  |,4-  ) )-,_. ,\ (  `'-'
University of Lethbridge                  '---''(_/--'  `-'\_)




More information about the metacard mailing list