@wikiviews/wikiviews-importer

v1.0.0-beta3

Published

2 years ago

Importer for the Wikipedia Pageviews (https://dumps.wikimedia.org/other/pageviews/) dataset.

Downloads

0High
0Medium
0Low

wikiviews

Elasticsearch data dataset import Wikiviews

Wikiviews Importer

A tool for importing excerpts from the Wikipedia Pageviews dataset into Elasticsearch for the use as data-backend for the Wikiviews application.

It automates the process of selecting, downloading and parsing the data and inserting them into Elasticsearch.

Installation

via NPM

The project provides an NPM package, which can be installed via

npm install -g @wikiviews/wikiviews-importer

This package is automatically generated for each repository tag via Travis-CI().

Manually

After cloning the repository, you can install the package manually via

npm install && npm run build && npm install -g

Usage

The tool provides the commandline utility

wv-import

to perform the import.

It optionally accepts a path to a configuration file as first parameter and optional configuration via commandline parameters.

Configuration

The importer can be configured via a configuration file in JSON format and via commandline parameters (in the format --{PARAMETERNAME}={VALUE}).

The following configuration options are available:

Option | Corresponding CLI parameter | Description | Default -------|-----------------------------|-------------|-------- tasks.download | download | If set to true, the selected data dumps are downloaded. If set to false, the selection is applied to all already existing elements in the destination directory, which are then used as source. | true tasks.elasticsearch | elasticsearch | If set to true, the selected data dumps are inserted into Elasticsearch, otherwise not. | true download.source | source | The source pattern. It needs to be a URL pointing to the data dumps containing variables (consecutive appearances of variables, which are longer than the value are padded with preceding zeros), which are then substituted by the selected date ranges. Keep in mind, that ALL occurrences of the variables are substituted and the variables are the same, which are used in the download.output pattern. so choose your variables wisely (or stick to the default). | 'https://dumps.wikimedia.org/other/pageviews/bbbb/bbbb-ff/pageviews-bbbbffjj-ll0000.gz' download.output | output | The output filename pattern. It needs to contain the same variables like the download.source pattern, which are then substituted by the selected date ranges. | 'bbbb-ff-jj-ll.csv' download.destination | destination | The destination directory, where the output files are written to. | ./data download.years | years | The selected range of downloaded years. The variable you have chosen for years (by default b) is substituted by the values in this range. It needs to be specified in the format '{VARIABLE}:{BEGINNING}-{END}'. | 'b:2016-2016' download.months | months | The selected range of downloaded months in each year. The variable you have chosen for months (by default f) is substituted by the values in this range. It needs to be specified in the format '{VARIABLE}:{BEGINNING}-{END}'. | 'f:1-1' download.days | days | The selected range of downloaded days in each month. The variable you have chosen for days (by default j) is substituted by the values in this range. It needs to be specified in the format '{VARIABLE}:{BEGINNING}-{END}'. | 'j:1-31' download.hours | hours | The selected range of downloaded hours in each day. The variable you have chosen for hours (by default j) is substituted by the values in this range. It needs to be specified in the format '{VARIABLE}:{BEGINNING}-{END}'. | 'l:0-23' download.concurrent | concurrentDownloads | The number of concurrently downloaded dump files. The Wikipedia Dump server allows only 3 concurrent downloads per source. | 3 download.compression | sourceCompression | The compression algorithm used to decompress the source files. The values 'gz' (for gzip) and 'zip' (for zip) are supported. All other values will deactivate decompression. The Wikipedia dumps are compressed via gzip. | 'gz' elasticsearch.port | esPort | The port on which the target Elasticsearch instance is listening. | 9200 elasticsearch.address | esAddress | The address / domain on which the target Elasticsearch instance is listening. | 'localhost' elasticsearch.index | esIndex | The target index in the Elasticsearch cluster. The default matches if the Cluster is set up via wikiviews-elasticsearch. | 'wikiviews' elasticsearch.type | esType | The target type in the Elasticsearch cluster. The default matches if the Cluster is set up via wikiviews-elasticsearch. | 'article' elasticsearch.concurrent | concurrentInsertions | The number of files, which get inserted concurrently into elasticsearch. Either 'all' or the number of files. | 'all' elasticsearch.batch | batchInsert | The number of batch-inserted dataset rows. Adapt the value to optimize your memory consumption. | 10000

For example the configuration file, which would set the default values would look like this:

{
  "tasks": {
    "download": true,
    "elasticsearch": true
  },
  "download": {
    "source": "https://dumps.wikimedia.org/other/pageviews/bbbb/bbbb-ff/pageviews-bbbbffjj-ll0000.gz",
    "output": "bbbb-ff-jj-ll.csv",
    "destination": "./data",
    "years": "b:2016-2016",
    "months": "f:1-1",
    "days": "j:1-31",
    "hours": "l:0-23",
    "concurrent": 3,
    "compression": "gz"
  },
  "elasticsearch": {
    "port": 9200,
    "address": "localhost",
    "index": "wikiviews",
    "type": "article",
    "concurrent": "all",
    "batch": 10000
  }
}

Running

The tool is ran via the wv-import command. Running the command with the configuration file config.json and without downloading the source data, would look like this:

wv-import config.json --download=false

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme