indexing slow-down (i.e., speeding up programs)

Geoff Canyon gcanyon at inspiredlogic.com
Sat Apr 12 20:11:00 EDT 2003


If you have enough memory, convert the second list to an array. Then 
your process will run linearly.

If you don't have enough memory, break the second list into chunks by 
letter, and then probably still do the array conversion. That or just 
use your regular routine.

On Saturday, April 12, 2003, at 01:14 PM, John Vokey wrote:

>   Here's my task:  I've got two LARGE text files, one a spelling 
> dictionary (from Excalibur) and the other a dictionary of lexical 
> frequencies (i.w., item and frequency); there over 113,000 unique 
> words (as lines) in the first, and about a million different entries 
> (as separate lines) in the second, many not regular words (e.g., 
> dates, numbers, etc.); furthermore, in the second, the same word is 
> listed multiple times depending on use (e.g., ``rose'' as a noun and 
> ``rose'' as a verb).  I want to extract the lexical frequency 
> (ignoring use type) of each of the words in the first from the second, 
> including assigning a frequency of zero for those from the first not 
> found in the second.
>
>   Fortunately, both are already alphabetised, so as I move 
> sequentially through the first, I can simply start searching from 
> where I last left off in the second, summing over multiple use 
> listings of the same word.  So far so good.   I use the ``repeat for 
> each ... in ... '' construct for the first dictionary, and a variable 
> pointer for the second that I advance as I find each word from the 
> first (I call a simple function that skips over non-words and advances 
> the pointer, returning the line of the next word in the lexical 
> dictionary).  Both text files are read into memory at the start of the 
> routine (which takes less than a second using the'' url file://...'' 
> command (way to go, metacard!).
>
>   Here's my problem: initially, the routine takes much less than 1 
> second per hundred words (which, when multiplied by the number of 
> words remaining to index, results in an estimate of some small 
> fraction of an hour to do the whole task).  However, it rapidly (as an 
> exponential function) slows down, so that by the time it reaches the 
> middle of the alphabet (M), it takes many minutes per 100 words, and 
> an ever-increasing time estimate for the remaining items (now over 60 
> hours!).  Clearly, either the ``repeat for each'' command for the 
> first dictionary  or the ``get line pointer...'' for the second (or 
> both) get(s) slower and slower as I progress through the dictionaries, 
> presumably because to do one or the other (or both), metacard counts 
> carriage returns from the beginning of the dictionary.
>
>   I've tried: deleting the items from the front of the lexical 
> dictionary as I've searched them, so that each new search starts at 
> line 1 of the edited list [i.e., put line pointer to (the number of 
> lines of dict2) of dict2 into dict2)], but that slows it down even 
> more (presumably because metacard needs to count crs to find the 
> number of the last line each time).  I've tried various machinations 
> of offset(,,skip) using the skip variable as an absolute pointer, but 
> it also appeared to make things slower presumably because it counts 
> chars from the start of the dictionary.  I guess I could divide dict2 
> (or dict1) into many small segments, moving to each successive segment 
> as the previous one was exhausted, but I was hoping for something more 
> elegant.
>
>   What I need are *absolute* pointers (preferably a memory address, or 
> a pointer or a handle to such), rather than the relative (to the 
> beginning of the list) pointers given by the line (and possibly ``for 
> each'') construct.  Arrays presumably would work (but doesn't metacard 
> then have to search the indices to resolve the array reference?), and 
> reading the million length file into the array just sets the problem 
> back one step.
>
>   Any suggestions would be appreciated.  As I receive the list in 
> digest form, if you have a scathingly brilliant idea, please send a 
> copy of it directly to my email address.  TIA
> -- 
> John R. Vokey, Ph.D.                       |\      _,,,---,,_
> Professor                                  /,`.-'`'    -.  ;-;;,_
> Department of Psychology and Neuroscience  |,4-  ) )-,_. ,\ (  `'-'
> University of Lethbridge                  '---''(_/--'  `-'\_)
>
> _______________________________________________
> metacard mailing list
> metacard at lists.runrev.com
> http://lists.runrev.com/mailman/listinfo/metacard
>
>

I hope this helps. Feel free to contact me if you have any further 
questions.

regards,

Geoff Canyon
Revolution Support
--
Geoff Canyon <geoff at runrev.com> <http://www.runrev.com/>
Runtime Revolution Limited: Software at the Speed of Thought
Tel: +44 (0) 870 747 1165.  Fax: +44 (0)1639 830 707.




More information about the metacard mailing list