wikipedia-elasticsearch-import
v1.0.2
Published
Import complete Wikipedia dump file(s) into Elasticsearch server
Downloads
3
Maintainers
Readme
wikipedia-elasticsearch-import
Import Wikipedia dumps into Elasticsearch server using streams and bulk indexing for speed.
What does this module do?
Wikipedia is publishing dumps of their whole database in every language on a regular basis, which you can download and use for free. This module parses the giant Wikipedia xml dump file, converts it into stream and imports the contents right into your Elasticsearch server or farm.
How to import Wikipedia dump into your own Elasticsearch server
- In order to import Wikpedia dump you must run the Elasticsearch server first. Please refer to the Elasticsearch documentation how to do this.
- Download latest Wikipedia from one of the following locations depending on the language you want:
- Wikipedia in English dump: https://dumps.wikimedia.org/enwiki/
- Wikipedia in German dump: https://dumps.wikimedia.org/dewiki/
- Wikipedia in Polish dump: https://dumps.wikimedia.org/plwiki/
- Other Wikidata dumps you can download: https://dumps.wikimedia.org/wikidatawiki/
- Unzip the downloaded file .xml.bz2, e.g. enwiki-20180801-pages-articles-multistream.xml.bz2 into the unzipped .xml file.
- Edit the config.js file to configure Wikipedia dump .xml file and Elasticsearch server connection settings.
- Run the importer with npm start and watch your Elasticsearch database is being populated with raw Wikipedia documents.
Settings
- You can set limit on bulk documents import in the config.js which is 100 by default.
- Set index, type, host, port, and logFile. If you enabled x-pack plugin for Elasticsearch you can also set the httpAuth setting, otherwise it's ignored.
Please contribute
Please visit my GitHub to post your questions, suggestions and pull requests.