wikipedia-elasticsearch-import

v1.0.2

Published

3 years ago

Import complete Wikipedia dump file(s) into Elasticsearch server

Downloads

0High
0Medium
0Low

pawelotto

dump elasticsearch wikimedia wikipedia

wikipedia-elasticsearch-import

Import Wikipedia dumps into Elasticsearch server using streams and bulk indexing for speed.

What does this module do?

Wikipedia is publishing dumps of their whole database in every language on a regular basis, which you can download and use for free. This module parses the giant Wikipedia xml dump file, converts it into stream and imports the contents right into your Elasticsearch server or farm.

How to import Wikipedia dump into your own Elasticsearch server

In order to import Wikpedia dump you must run the Elasticsearch server first. Please refer to the Elasticsearch documentation how to do this.
Download latest Wikipedia from one of the following locations depending on the language you want:

Wikipedia in English dump: https://dumps.wikimedia.org/enwiki/
Wikipedia in German dump: https://dumps.wikimedia.org/dewiki/
Wikipedia in Polish dump: https://dumps.wikimedia.org/plwiki/
Other Wikidata dumps you can download: https://dumps.wikimedia.org/wikidatawiki/

Unzip the downloaded file .xml.bz2, e.g. enwiki-20180801-pages-articles-multistream.xml.bz2 into the unzipped .xml file.
Edit the config.js file to configure Wikipedia dump .xml file and Elasticsearch server connection settings.
Run the importer with npm start and watch your Elasticsearch database is being populated with raw Wikipedia documents.

Settings

You can set limit on bulk documents import in the config.js which is 100 by default.
Set index, type, host, port, and logFile. If you enabled x-pack plugin for Elasticsearch you can also set the httpAuth setting, otherwise it's ignored.

Please contribute

Please visit my GitHub to post your questions, suggestions and pull requests.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

wikipedia-elasticsearch-import

What does this module do?

How to import Wikipedia dump into your own Elasticsearch server

Settings

Please contribute