Words Indexing strategies

Alejandro Tejada capellan2000 at gmail.com
Mon Feb 8 18:00:04 CST 2010


Hi all,

Some time ago, i posted a message asking for
volunteers to create a Wikipedia CD/DVD.

Since then, i have been working on this project
and have done some advances, that will be
published as soon they work as expected.

Now, i need advice about possible strategies
to create a fast and responsive word index
for all Wikipedia articles, similar to the capabilities
demostrated by google search engine, with
suggested search terms and similar words.

Notice that to index the article's titles i am not using
any database engine in this project.

For memory constrains and performance reasons,
these are the steps i followed:

1) Wikipedia XML database is divided in multiple small
UTF8 text files (each aprox. 1 MB) compressed in .gz format
(reduced to 350-250 Kb). I have files numbered from 00001
to 06455 for spanish Wikipedia. English Wikipedia runs from
00001 to 28750.

NOTE: Using such small database files, allows users to read
quickly any linked article because the program find, decompress
and process a small file. This is fast, even in old computers.

2) Each database part is indexed for article titles and words.

3) These multiple index files are merged into one big UTF8 index
text file arranged in alphabetical order.

4) Split the big UTF8 index text file in 28 small UTF8 index
text files. That is, a different file for each letter:

1 file for Decimal ASCII 33 to 64: ! to @
26 files for Decimal ASCII 65 to 90: A to Z
1 file for Decimal ASCII 91 and more...

Largest UTF8 index text file is the letter C

5) When users click an article link, program checks for
the first letter of clicked link and search article
name in the corresponding index.
That is: a linked article that starts with G is
searched only in the UTF8 Article Index "G"

This works fine 99.9% of time because there are some
errors with names of linked articles.

Now, i am looking for advice to create an index structure for searching
specific words inside article's text. i have been unable to implement
a fast search algorithm, using multiple words, similar to Wikipedia's
own search engine. Every idea or advice is welcome.

Thanks in advance!

Alejandro



More information about the use-livecode mailing list