npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

mhdscraper

v1.0.3

Published

Mediawiki history dumps scraper, a module that scrapes the site and returns to you the available content.

Downloads

1

Readme

mediawiki-history-dumps-scraper

This is the npm module of "Mediawiki history dumps scraper", refer to the main branch to see in general the projects' purpose.

What does the module do?

This npm module allows you to get (also selectively), through a scraper, the available content in Mediawiki history dumps. You can check wich versions are available, which language, which datasets, the download links, the size...

How was it made?

This module was written in Typescript and uses axios and regexps to scrape the content from the Download site. The code is linted with eslint and prettier, and bundled with webpack. A code documentation made with Typedoc and hosted with Vercel is available at https://mhdscraper.euber.dev.

How to use it?

Installation

npm install mhdscraper

Examples

An example (you can add console.log of a variable to see the response).

import * as mhdscraper from 'mhdscraper';

async function main() {
    // Returns the root url of the datasets site
    const root_url = mhdscraper.WIKI_URL;

    // Returns an array of versions, returning the version name and its url
    const versions = await mhdscraper.fetchVersions();
    // Returns an array of datasets, returning the dataset name, its url and 
    // including all the available wikies (name and url)
    const versionsWithLangs = await mhdscraper.fetchVersions({
        wikies: true
    });

    // Returns an array containing all the wikies of the latest version,
    // returning name and url
    const wikies = await mhdscraper.fetchWikies('latest');
    // Returns an array containing the wikies ending with 'wiki' of the 
    // latest version, returning name and url
    const wikiesEndingWIthWiki = await mhdscraper.fetchWikies('latest', {
        wikitype: 'wiki'
    });
    // Returns an array containing the wikies starting with 'it' of the latest version, 
    // returning name, url and the array of available dumps
    const wikiesWithDumps = await mhdscraper.fetchWikies('latest', {
        lang: 'it',
        dumps: true
    });

    // Returns an array containing all the dumps of 'itwiki' of the latest version, 
    // reurning many pieces of information such as filename, start and end date 
    // of the content, size in bytes, url to download it...
    const dumps = await mhdscraper.fetchDumps('latest', 'itwiki');
    // Returna an arrayo containing all the dumps of 'itwiki' of the latest version,
    // whose content is between 2004-01-01 and 2005-02-01
    const dumpsSelected = await mhdscraper.fetchDumps('latest', 'itwiki', {
        start: new Date('2004-01-01'),
        end: new Date('2005-02-01')
    });

}
main();

The result of:

import * as mhdscraper from 'mhdscraper';

async function main() {
    const result = await mhdscraper.fetchWikies('latest', {
        lang: 'it',
        wikitype: 'wiki',
        dumps: true,
        start: new Date('2010-01-01'),
        end: new Date('2012-12-31'),
    });
}
main();

Would be (as of July 2021):

[
    {
        "dumps": [
            {
                "bytes": "691419132",
                "filename": "2021-06.itwiki.2010.tsv.bz2",
                "from": "2010-01-01",
                "lastUpdate": "2021-07-03T10:38:00",
                "time": "2010",
                "to": "2010-12-31",
                "url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki/2021-06.itwiki.2010.tsv.bz2"
            },
            {
                "bytes": "706208269",
                "filename": "2021-06.itwiki.2011.tsv.bz2",
                "from": "2011-01-01",
                "lastUpdate": "2021-07-03T10:57:00",
                "time": "2011",
                "to": "2011-12-31",
                "url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki/2021-06.itwiki.2011.tsv.bz2"
            },
            {
                "bytes": "747376403",
                "filename": "2021-06.itwiki.2012.tsv.bz2",
                "from": "2012-01-01",
                "lastUpdate": "2021-07-03T10:11:00",
                "time": "2012",
                "to": "2012-12-31",
                "url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki/2021-06.itwiki.2012.tsv.bz2"
            }
        ],
        "url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki",
        "wiki": "itwiki"
    }
]

API

WIKI_URL

It is a constant containing the url of the root of the datasets site

async fetchLatestVersion(options)

Fetches the last version of the mediawiki history dumps.

The version is the year-month of the release of the dumps

Options' fields:

  • wikies (booleanean, default=false)_: If for each returned version the wikies will be fetched
  • lang (string, default=undefined): If the wikies argument is true, the language of the wikies to return (a wiki name starts with the language).
  • wikitype (string, default=undefined): If the wikies argument is true, the wiki type of the wikies to return (a wiki name ends with the wiki type).
  • dumps (boolean, default=false): If for each returned wiki the wikies will be fetched
  • start (Date, default=undefined): If the wikies and dumps arguments are true, retrieve only the dumps newer than this date
  • end (Date, default=undefined): If the wikies and dumps arguments are true, retrieve only the dumps older than this date

Returns an object with:

  • version (string) for the version year-month
  • url (string) for the url of that version.
  • wikies will contain the fetched wikies if the argument was set to true.
    If no version is found, None is returned.

fetchVersions(options)

Fetch the versions of the mediawiki history dumps

The versions are the year-month of the release of the dumps

Options' fields:

  • wikies (boolean, default=False)_: If for each returned version the wikies will be fetched
  • lang (string, default=undefined): If the wikies argument is true, the language of the wikies to return (a wiki name starts with the language).
  • wikitype (string, default=undefined): If the wikies argument is true, the wiki type of the wikies to return (a wiki name ends with the wiki type).
  • dumps (boolean, default=false): If for each returned wiki the wikies will be fetched
  • start (Date, default=undefined): If the wikies and dumps arguments are true, retrieve only the dumps newer than this date
  • end (Date, default=undefined): If the wikies and dumps arguments are true, retrieve only the dumps older than this date

Returns an array of objects with:

  • version (string) for the version year-month
  • url (string) for the url of that version.
  • wikies will contain the fetched wikies if the argument was set to true (see fetch_wikies to see the result).

fetchWikies(version, options)

Fetch the wikies of a version of the mediawiki history dumps

Parameters:

  • version (string): The version whose wikies will be returned. If "latest" is passed, the latest version is retrieved.

Options' fields:

  • lang (string, default=undefined): The language of the wikies to return (a wiki name starts with the language).
  • wikitype (string, default=undefined): The wiki type of the wikies to return (a wiki name ends with the wiki type).
  • dumps (boolean, default=false): If for each returned wiki the dumps will be fetched
  • start (Date, default=undefined): If the dumps argument is true, retrieve only the dumps newer than this date
  • end (Date, default=undefined): If the dumps argument is true, retrieve only the dumps older than this date

Returns an array of objects with:

  • wiki (string) for the wiki name
  • url (string) for the url of that wiki. In addition, if the dumps argument is true, a dumps (array) field contain the fetched dumps (see fetch_dumps to see the reuslt).

fetchDumps(version, wiki, options)

Fetch the dumps of a wiki of the mediawiki history dumps

Parameters:

  • version (string): The version of the wiki
  • wiki (string): The wiki whose dumps will be returned

Options' fields:

  • start (Date, default=undefined): Retrieve only the dumps newer than this date
  • end (Date, default=undefined): Retrieve only the dumps older than this date

Returns an array of objects with:

  • filename (string) for dump file name
  • time (string) for the time of the data ('all-time', year or year-month
  • lastUpdate (Datetime) for the last update date
  • bytes (int) for the size in bytes of the file
  • from (Date) for the start date of the data
  • to (Date) for the end date of the data
  • url (string) the url of the file