Words Indexing strategies

Alejandro Tejada capellan2000 at gmail.com
Mon Feb 22 13:12:56 EST 2010


Some time ago, i posted a message asking for
volunteers to create a Wikipedia CD/DVD.

Since then, i have been working on this project
and have done some advances, that will be
published as soon they work as expected.

These are the steps that i am following to
process the XML databases:

Wikipedia XML database are huge and we could
get best results when XML databases are
downloaded directly from:

http://download.wikipedia.org/

The download direction for wikipedia English
XML database from 2010 February 3 is:
http://download.wikimedia.org/enwiki/20100130/enwiki-20100130-pages-articles.xml.bz2

download direction for wikipedia Spanish
XML database from 2010 February 21 is:
http://download.wikimedia.org/eswiki/20100221/eswiki-20100221-pages-meta-current.xml.bz2

After downloading the compressed xml database, you should
put the database inside a folder (not in the disk root) and split
the file in small bz2 files using bzip2recover.
http://www.bzip.org/downloads.html
http://www.bzip.org/1.0.5/bzip2recover-105-x86-win32.exe

It is easier to deal with many compressed small files than using
one humongous text file of more than 25 GB (english xml database)
or 5.3 GB (spanish xml database).

After using bzip2recover to split the English xml database,
i get more than 28,000 small (~250 kb) bz2 files or
6800 small bz2 files for Spanish xml databases.
Each one of these files have (more of less) a 1 MB segment
of the database.

Notice that i choose a different file for spanish xml database
than english xml database. That is because Wikipedia have
been unable to solve a problem with their backup of spanish
xml database.

https://bugzilla.wikimedia.org/show_bug.cgi?id=18694

Alejandro



More information about the use-livecode mailing list