wikipedia-to-mongodb
v2.4.0
Published
get a wikipedia dump parsed into mongodb
Downloads
24
Readme
A whole Wikipedia dump, in mongodb.
put your hefty wikipedia dump into mongo, with fully-parsed wikiscript - without thinking, without loading it into memory, grepping, unzipping, or other crazy command-line nonsense.
It's a javascript one-liner that puts a highly-queryable wikipedia on your laptop in a nice afternoon.
It uses wtf_wikipedia to parse wikiscript into almost-nice json.
npm install -g wikipedia-to-mongodb
⚡ From the Command-Line:
wp2mongo /path/to/my-wikipedia-article-dump.xml.bz2
😎 From a nodejs script
var wp2mongo = require('wikipedia-to-mongodb')
wp2mongo({file:'./enwiki-latest-pages-articles.xml.bz2', db: 'enwiki'}, callback)
then check out the articles in mongo:
$ mongo #enter the mongo shell
use enwiki #grab the database
db.wikipedia.find({title:"Toronto"})[0].categories
#[ "Former colonial capitals in Canada",
# "Populated places established in 1793" ...]
db.wikipedia.count({type:"redirect"})
# 124,999...
Steps:
1) 💪 you can do this.
you can do this. a few Gb. you can do this.
2) get ready
Install nodejs, mongodb, and optionally redis
# start mongo
mongod --config /mypath/to/mongod.conf
# install wp2mongo
npm install -g wikipedia-to-mongodb
that gives you the global command wp2mongo
.
3) download a wikipedia
The Afrikaans wikipedia (around 47,000 artikels) only takes a few minutes to download, and 10 mins to load into mongo on a macbook:
# dowload an xml dump (38mb, couple minutes)
wget https://dumps.wikimedia.org/afwiki/latest/afwiki-latest-pages-articles.xml.bz2
the english/german ones are bigger. Use whichever xml dump you'd like. The download page is weird, but you'll want the most-common dump format, without historical diffs, or images, which is ${LANG}wiki-latest-pages-articles.xml.bz2
4) get it going
#load it into mongo (10-15 minutes)
wp2mongo ./afwiki-latest-pages-articles.xml.bz2
5) take a bath
just put some epsom salts in there, it feels great. You deserve a break once and a while. The en-wiki dump should take a few hours. Should be done before dinner.
6) check-out your data
to view your data in the mongo console,
$ mongo
use af_wikipedia
//shows a random page
db.wikipedia.find().skip(200).limit(2)
//count the redirects (~5,000 in afrikaans)
db.wikipedia.count({type:"redirect"})
//find a specific page
db.wikipedia.findOne({title:"Toronto"}).categories
Same for the English wikipedia:
the english wikipedia will work under the same process, but the download will take an afternoon, and the loading/parsing a couple hours. The en wikipedia dump is a 13 GB (for enwiki-20170901-pages-articles.xml.bz2), and becomes a pretty legit mongo collection uncompressed. It's something like 51GB, but mongo can do it... You can do it!
Options
human-readable plaintext --plaintext
wp2mongo({file:'./myfile.xml.bz2', db: 'enwiki', plaintext:true}, console.log)
/*
[{
_id:'Toronto',
title:'Toronto',
plaintext:'Toronto is the most populous city in Canada and the provincial capital...'
}]
*/
go faster with Redis --worker
there is yet much faster way (even x10) to import all pages into mongodb but a little more complex. it requires redis installed on your computer and running worker in separate process.
It also gives you a cool dashboard, to watch the progress.
# install redis
sudo apt-get install # (or `brew install redis` on a mac)
# clone the repo
git clone [email protected]:spencermountain/wikipedia-to-mongodb.git && cd wikipedia-to-mongodb
#load pages into job queue
bin/wp2mongo.js ./afwiki-latest-pages-articles.xml.bz2 --worker
# start processing jobs (parsing articles and saving to mongodb) on all CPU's
node src/worker.js
# you can preview processing jobs in kue dashboard (localhost:3000)
node node_modules/kue/bin/kue-dashboard -p 3000
skip unnecessary pages --skip_disambig, --skip_redirects
this can make it go faster too, by skipping entries in the dump that aren't full-on articles.
let obj = {
file: './path/enwiki-latest-pages-articles.xml.bz2',
db: 'enwiki',
skip_redirects: true,
skip_disambig: true,
skip_first: 1000, // ignore the first 1k pages
verbose: true, // print each article title
}
wp2mongo(obj, () => console.log('done!') )
how it works:
this library uses:
unbzip2-stream to stream-uncompress the gnarly bz2 file
xml-stream to stream-parse its xml format
wtf_wikipedia to brute-parse the article wikiscript contents into JSON.
redis to (optionally) put wikiscript parsing on separate threads :metal:
Addendum:
_ids
since wikimedia makes all pages have globally unique titles, we also use them for the mongo _id
fields.
The benefit is that if it crashes half-way through, or if you want to run it again, running this script repeatedly will not multiply your data. We do a 'upsert' on the record.
encoding special characters
mongo has some opinions on special-characters in some of its data. It is weird, but we're using this standard(ish) form of encoding them:
\ --> \\
$ --> \u0024
. --> \u002e
Non-wikipedias
This library should also work on other wikis with standard xml dumps from MediaWiki. I haven't tested them, but the wtf_wikipedia supports all sorts of non-standard wiktionary/wikivoyage templates, and if you can get a bz-compressed xml dump from your wiki, this should work fine. Open an issue if you find something weird.
PRs welcome!
MIT