mwoffliner
v1.13.0
Published
Mediawiki ZIM scraper
Downloads
552
Readme
MWoffliner
MWoffliner is a tool for making a local offline HTML snapshot of any online MediaWiki instance. It goes through all online articles (or a selection if specified) and create the corresponding ZIM file. It has mainly been tested against Wikimedia projects like Wikipedia and Wiktionary --- but it should also work for any recent MediaWiki.
Read CONTRIBUTING.md to know more about MWoffliner development.
Features
- Scrape with or without image thumbnail
- Scrape with or without audio/video multimedia content
- S3 cache (optional)
- Image size optimiser / Webp converter
- Scrape all articles in namespaces or title list based
- Specify additional/non-main namespaces to scrape
Run mwoffliner --help
to get all the possible options.
Prerequisites
- *NIX Operating System (GNU/Linux, macOS, ...)
- Redis
- NodeJS version 16 or greater
- Libzim (On GNU/Linux & macOS we automatically download it)
- Various build tools which are probably already installed on your
machine (packages
libjpeg-dev
,libglu1
,autoconf
,automake
,gcc
on Debian/Ubuntu)
... and an online MediaWiki with its API available.
Usage
To install MWoffliner globally:
npm i -g mwoffliner
You might need to run this command with the sudo
command, depending
how your npm
is configured.
npm
permission checking can be a bit annoying for a
newcomer. Please read the documentation carefully if you hit
problems: https://docs.npmjs.com/cli/v7/using-npm/scripts#user
Then to run it:
mwoffliner --help
To install and run it locally:
npm i
npm run mwoffliner -- --help
To use MWoffliner with a S3 cache, you should provide a S3 URL like this:
--optimisationCacheUrl="https://wasabisys.com/?bucketName=my-bucket&keyId=my-key-id&secretAccessKey=my-sac"
API
MWoffliner provides also an API and therefore can be used as a NodeJS library. Here a stub example:
const mwoffliner = require('mwoffliner');
const parameters = {
mwUrl: "https://es.wikipedia.org",
adminEmail: "[email protected]",
verbose: true,
format: "nopic",
articleList: "./articleList"
};
mwoffliner.execute(parameters); // returns a Promise
Background
Complementary information about MWoffliner:
- MediaWiki software is used by thousands of wikis, the most famous ones being the Wikimedia ones, including Wikipedia.
- MediaWiki is a PHP wiki runtime engine.
- Wikitext is the name of the markup language that MediaWiki uses.
- MediaWiki includes a parser for WikiText into HTML, and this parser creates the HTML pages displayed in your browser.
GNU/Linux - Debian based distributions
Install NodeJS: Read https://nodejs.org/en/download/current/
Install Redis:
sudo apt-get install redis-server
Troubleshooting
Older GNU/Linux distributions and/or versions of Node.js might be
shipped with a deprecated version of npm
. Older versions of npm
have incompatbilities with certain versions of Node.js and might
simply fail to install mwoffliner
package.
We recommend to use a recent version of npm
. Recent versions can
perfectly deal with older Node.js 10. Do install the packaged
version of npm
and then use it to install a newer version like:
sudo npm install --unsafe-perm -g npm
Don't forget to remove the packaged version of npm
afterward.