index library

Mon Apr 19 01:23:53 EDT 2004

If you're looking for inspiring documentation, I would recommend 
checking out Apple's old AIAT SDK. It's the basis for MacOS 
Find-By-Content, but the really cool thing is that the documentation 
spells out very nicely how vector and inverted indices work.

Or... try googling "inverted vector index".

In the past I've hooked several engines up with Rev (including AIAT) 
but they all required externals and/or separate apps running (there's a 
Java-based spinoff of AIAT called "Lucene" from the Apache project 
which is interesting but you'd have to write a java app and talk back 
and forth most likely).

If you could implement the basic inverted vector index algorithms and 
figure out an efficient way to store the indices on disk, it could 
become a pretty decent engine in Transcript, even if it might not be 
suitable for indexing your hard drive or spidering the web...

For more fun reading, there's stemming (which is pretty crude and 
easy), thesauri (which you have to be very careful with or you just 
increase noise), stopword removal (i.e. cutting out the "and" and "the" 
words), and relevancy ranking. All of this is covered in the 
aforementioned AIAT SDK.

Pretty interesting stuff, keep me posted if you take a crack at it- I 
can't really co-conspire at the moment but I'd be happy to chime in 
where I'm helpful.

HTH,
Brian

> hypertexting of words in a large text corpus. I can find several such 
> libraries on web,
> but in languages that dont port well to transcript (ie, needing 
> pointers and
> multidim arrays. sigh). I would gladly work with anybody wanting to do 
> one.