wikipedia-to-mongodb

v2.4.0

Published

2 years ago

get a wikipedia dump parsed into mongodb

Downloads

0High
0Medium
0Low

spencermountain

A whole Wikipedia dump, in mongodb.

put your hefty wikipedia dump into mongo, with fully-parsed wikiscript - without thinking, without loading it into memory, grepping, unzipping, or other crazy command-line nonsense.

It's a javascript one-liner that puts a highly-queryable wikipedia on your laptop in a nice afternoon.

It uses wtf_wikipedia to parse wikiscript into almost-nice json.

npm install -g wikipedia-to-mongodb

⚡ From the Command-Line:

wp2mongo /path/to/my-wikipedia-article-dump.xml.bz2

😎 From a nodejs script

var wp2mongo = require('wikipedia-to-mongodb')
wp2mongo({file:'./enwiki-latest-pages-articles.xml.bz2', db: 'enwiki'}, callback)

then check out the articles in mongo:

$ mongo        #enter the mongo shell
use enwiki     #grab the database

db.wikipedia.find({title:"Toronto"})[0].categories
#[ "Former colonial capitals in Canada",
#  "Populated places established in 1793" ...]
db.wikipedia.count({type:"redirect"})
# 124,999...

Steps:

1) 💪 you can do this.

you can do this. a few Gb. you can do this.

2) get ready

Install nodejs, mongodb, and optionally redis

# start mongo
mongod --config /mypath/to/mongod.conf
# install wp2mongo
npm install -g wikipedia-to-mongodb

that gives you the global command wp2mongo.

3) download a wikipedia

The Afrikaans wikipedia (around 47,000 artikels) only takes a few minutes to download, and 10 mins to load into mongo on a macbook:

# dowload an xml dump (38mb, couple minutes)
wget https://dumps.wikimedia.org/afwiki/latest/afwiki-latest-pages-articles.xml.bz2

the english/german ones are bigger. Use whichever xml dump you'd like. The download page is weird, but you'll want the most-common dump format, without historical diffs, or images, which is ${LANG}wiki-latest-pages-articles.xml.bz2

4) get it going

#load it into mongo (10-15 minutes)
wp2mongo ./afwiki-latest-pages-articles.xml.bz2

5) take a bath

just put some epsom salts in there, it feels great. You deserve a break once and a while. The en-wiki dump should take a few hours. Should be done before dinner.

6) check-out your data

to view your data in the mongo console,

$ mongo
use af_wikipedia

//shows a random page
db.wikipedia.find().skip(200).limit(2)

//count the redirects (~5,000 in afrikaans)
db.wikipedia.count({type:"redirect"})

//find a specific page
db.wikipedia.findOne({title:"Toronto"}).categories

Same for the English wikipedia:

the english wikipedia will work under the same process, but the download will take an afternoon, and the loading/parsing a couple hours. The en wikipedia dump is a 13 GB (for enwiki-20170901-pages-articles.xml.bz2), and becomes a pretty legit mongo collection uncompressed. It's something like 51GB, but mongo can do it... You can do it!

Options

human-readable plaintext --plaintext

wp2mongo({file:'./myfile.xml.bz2', db: 'enwiki', plaintext:true}, console.log)
/*
[{
  _id:'Toronto',
  title:'Toronto',
  plaintext:'Toronto is the most populous city in Canada and the provincial capital...'
}]
*/

go faster with Redis --worker

there is yet much faster way (even x10) to import all pages into mongodb but a little more complex. it requires redis installed on your computer and running worker in separate process.

It also gives you a cool dashboard, to watch the progress.

# install redis
sudo apt-get install # (or `brew install redis` on a mac)

# clone the repo
git clone [email protected]:spencermountain/wikipedia-to-mongodb.git && cd wikipedia-to-mongodb

#load pages into job queue
bin/wp2mongo.js ./afwiki-latest-pages-articles.xml.bz2 --worker

# start processing jobs (parsing articles and saving to mongodb) on all CPU's
node src/worker.js

# you can preview processing jobs in kue dashboard (localhost:3000)
node node_modules/kue/bin/kue-dashboard -p 3000

skip unnecessary pages --skip_disambig, --skip_redirects

this can make it go faster too, by skipping entries in the dump that aren't full-on articles.

let obj = {
	file: './path/enwiki-latest-pages-articles.xml.bz2',
	db: 'enwiki',
	skip_redirects: true,
	skip_disambig: true,
	skip_first: 1000, // ignore the first 1k pages
	verbose: true, // print each article title
}
wp2mongo(obj, () => console.log('done!') )

how it works:

this library uses:

unbzip2-stream to stream-uncompress the gnarly bz2 file
xml-stream to stream-parse its xml format
wtf_wikipedia to brute-parse the article wikiscript contents into JSON.
redis to (optionally) put wikiscript parsing on separate threads :metal:

Addendum:

_ids

since wikimedia makes all pages have globally unique titles, we also use them for the mongo _id fields. The benefit is that if it crashes half-way through, or if you want to run it again, running this script repeatedly will not multiply your data. We do a 'upsert' on the record.

encoding special characters

mongo has some opinions on special-characters in some of its data. It is weird, but we're using this standard(ish) form of encoding them:

\  -->  \\
$  -->  \u0024
.  -->  \u002e

Non-wikipedias

This library should also work on other wikis with standard xml dumps from MediaWiki. I haven't tested them, but the wtf_wikipedia supports all sorts of non-standard wiktionary/wikivoyage templates, and if you can get a bz-compressed xml dump from your wiki, this should work fine. Open an issue if you find something weird.

PRs welcome!

MIT