npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

dumpster-dip

v2.0.0

Published

parse a wikipedia dump into tiny files

Downloads

8

Readme

The data exports from wikimedia, arguably the world's most-important datasets, exist as huge xml files, in a notorious markup format.

dumpster-dip can flip this dataset into individual json or text files.

Command-Line

the easiest way to get started is to simply run:

npx dumpster-dip

which is a wild, no-install, no-dependency way to get going.

Follow the prompts, and this will download, unzip, and parse any-language wikipedia, into a selected format.

The optional params are:

--lang fr             # do the french wikipedia
--output encyclopedia # add all 'E' pages to ./E/
--text                # return plaintext instead of json

JS API

Also available to be used as a powerful javascript library:

npm install dumpster-dip
import dumpster from 'dumpster-dip' // or require('dumpster-dip')

await dumpster({ file: './enwiki-latest-pages-articles.xml' }) // 😅

This will require you to download and unzip a dump yourself. Instructions below. Depending on the language, it may take a couple hours.

Instructions

1. Download a dump cruise the wikipedia dump page and look for ${LANG}wiki-latest-pages-articles.xml.bz2

bzip2 -d ./enwiki-latest-pages-articles.xml.bz2

import dip from 'dumpster-dip'

const opts = {
  input: './enwiki-latest-pages-articles.xml',
  parse: function (doc) {
    return doc.sentences()[0].text() // return the first sentence of each page
  }
}

dip(opts).then(() => {
  console.log('done!')
})

en-wikipedia takes about 4hrs on a macbook. See expected article counts here

Options

{
  file: './enwiki-latest-pages-articles.xml', // path to unzipped dump file relative to cwd
  outputDir: './dip', // directory for all our new file(s)
  outputMode: 'nested', // how we should write the results

  // define how many concurrent workers to run
  workers: cpuCount, // default is cpu count
  //interval to log status
  heartbeat: 5000, //every 5 seconds

  // which wikipedia namespaces to handle (null will do all)
  namespace: 0, //(default article namespace)
  // parse redirects, too
  redirects: false,
  // parse disambiguation pages, too
  disambiguation: true,

  // allow a custom wtf_wikipedia parsing library
  libPath: 'wtf_wikipedia',

  // should we skip this page or return something?
  doPage: function (doc) {
    return true
  },

  // what do return, for every page
  //- avoid using an arrow-function
  parse: function (doc) {
    return doc.json()
  }
}

Output formats:

dumpster-dip comes with 4 output formats:

  • 'flat' - all files in 1 directory
  • 'encyclopedia' - all 'E..' pages in ./e
  • 'encyclopedia-two' - all 'Ed..' pages in ./ed
  • 'hash' (default) - 2 evenly-distributed directories
  • 'ndjson' - all data in one file

Sometimes operating systems don't like having ~6m files in one folder - so these options allow different nesting structures:

Encyclopedia

to put files in folders indexed by their first letter, do:

let opts = {
  outputDir: './results',
  outputMode: 'encyclopedia'
}

Remember, some directories become way larger than others. Also remember that titles are UTF-8.

For two-letter folders, use outputMode: 'encyclopedia-two'

Hash (default)

This format nests each file 2-deep, using the first 4 characters of the filename's hash:

/BE
  /EF
    /Dennis_Rodman.txt
    /Hilary_Clinton.txt

Although these directory names are meaningless, the advantage of this format is that files will be distributed evenly, instead of piling-up in the 'E' directory.

This is the same scheme that wikipedia does internally.

as a helper, this library exposes a function for navigating this directory scheme:

import getPath from 'dumpster-dip/nested-path'
let file = getPath('Dennis Rodman')
// ./BE/EF/Dennis_Rodman.txt
Flat:

if you want all files in one flat directory, you can cross your fingers and do:

let opts = {
  outputDir: './results',
  outputMode: 'flat'
}
Ndjson

You may want all results in one newline-delimited file. Using this format, you can produce TSV or CSV files:

let opts = {
  outputDir: './results',
  outputMode: 'ndjson',
  parse: function (doc) {
    return [doc.title(), doc.text().length].join('\t')
  }
}

Examples:

Wikipedia is often a complicated place. Getting specific data may require some investigation, and experimentation:

See runnable examples in ./examples

Birthdays of basketball players

Process only the 13,000 pages with the category American men's basketball players

await dip({
  input: `./enwiki-latest-pages-articles.xml`,
  doPage: function (doc) {
    return doc.categories().find((cat) => cat === `American men's basketball players`)
  },
  parse: function (doc) {
    return doc.infobox().get('birth_date')
  }
})

Film Budgets

Look for pages with the Film infobox and grab some properties:

await dip({
  input: `./enwiki-latest-pages-articles.xml`,
  outputMode: 'encyclopedia',
  doPage: function (doc) {
    // look for anything with a 'Film' 'infobox
    return doc.infobox() && doc.infobox().type() === 'film'
  },
  parse: function (doc) {
    let inf = doc.infobox()
    // pluck some values from its infobox
    return {
      title: doc.title(),
      budget: inf.get('budget'),
      gross: inf.get('gross')
    }
  }
})

Talk Pages

Talk pages are not found in the normal 'latest-pages-articles.xml' dump. Instead, you must download the larger 'latest-pages-meta-current.xml' dump. To process only Talk pages, set 'namespace' to 1.

const opts = {
  input: `./enwiki-latest-pages-meta-current.xml`,
  namespace: 1, // do talk pages only
  parse: function (doc) {
    return doc.text() //return their text
  }
}

Customization

Given the parse callback, you're free to return anything you'd like.

One of the charms of wtf_wikipedia is its plugin system, which allows users to add any new features.

Here we apply a custom plugin to our wtf lib, and pass it in to be available each worker:

in ./myLib.js

import wtf from 'wtf_wikipedia'

// add custom analysis as a plugin
wtf.plugin((models, templates) => {
  // add a new method
  models.Doc.prototype.firstSentence = function () {
    return this.sentences()[0].text()
  }
  // support a missing plugin
  templates.pingponggame = function (tmpl, list) {
    let arr = tmpl.split('|')
    return arr[1] + ' to ' + arr[2]
  }
})
export default wtf

then we can pass this version into dumpster-dip:

import dip from 'dumpster-dip'

dip({
  input: '/path/to/dump.xml',
  libPath: './myLib.js', // our version (relative to cwd)
  parse: function (doc) {
    return doc.firstSentence() // use custom method
  }
})

See the plugins available, such as the NHL season parser, the nsfw tagger, or a parser for disambiguation pages.


👋

We are commited to making this library into a great tool for parsing mediawiki projects.

Prs welcomed and respected.

MIT