npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

wikipedia-list-extractor

v0.4.1

Published

Read entries from Wikipedia lists

Downloads

8

Readme

wikipedia-list-extractor

Wikipedia has lists of objects (e.g. monuments), often referenced by governmental data (e.g. heritage protection). This module helps to extract data from these lists.

Example: The sub-pages of [https://de.wikipedia.org/wiki/Denkmalgesch%C3%BCtzte_Objekte_in_%C3%96sterreich](Denkmalgeschützte Objekte in Österreich) will list all heritage protected objects in Austria. This module will return individual items of this list as JSON objects. The ID within this module for this list is 'AT-BDA'. The items can be referenced either by their ID (e.g. 'id-24536') or their Wikidata-ID (e.g. 'Q1534177') or their page plus index (e.g. 'Liste der denkmalgeschützten Objekte in Wien/Innere Stadt/E–He#69').

There's a demo-application where you can view items on a map: https://openstreetmap.at/demo-wikipedia-list-extractor (Source).

In data/ there are config files for each type of list.

Usage

Stand-alone with NodeJS server (included with the dev dependencies)

git clone https://github.com/plepe/wikipedia-list-extractor
cd wikipedia-list-extractor
npm install
npm start

Point your browser to http://localhost:8080/ for the interactive App.

You can try the list 'AT-BDA' and as ID 'id-24536' or 'Q1534177'. Both IDs should return the Goethedenkmal in Vienna.

Additionally, the standalone server exposes a HTTP API which you can query: http://localhost:8080/api//

  • where list is the ID of a list (e.g. INT-UNESCO)
  • where id is one or several ids, comma separated

Example:

curl http://localhost:8080/api/INT-UNESCO-de/91,80

As module within a NodeJS application

Wikipedia List Extractor uses a few modules (node-fetch, jsdom) as indirect dependencies (so they don't get compiled when using browserify). These have to be exposed as global variables. This can be done by requiring wikipedia-list-extractor/node.

let extractor = new MediawikiListExtractor('INT-UNESCO-de', null, {
  path: 'node_modules/wikipedia-list-extractor/data',
})
extractor.get(['91', '80'], (err, result) => {
  console.log(err, JSON.stringify(result, null, '  '))
})

Stand-alone on a PHP server (e.g. with Apache2)

cd /var/www/html
git clone https://github.com/plepe/wikipedia-list-extractor
cd wikipedia-list-extractor
npm install

Point your browser to https://server/wikipedia-list-extractor

You have to select 'Run code in browser', as the PHP code does not implement the server side.

As module within a web application in a browser

As Wikipedia does not allow requests from a web browser, when they do not originate from a wikipedia page, we have to use a proxy. The URL of the proxy has to be supplied with the options, when loading MediawikiListExtractor:

// def is the file data/INT-UNESCO.json as Javascript Object
let extractor = new MediawikiListExtractor('INT-UNESCO', null, {
  path: 'node_modules/wikipedia-list-extractor/data',
  proxy: 'proxy/?'
})
extractor.get(['91', '80'], (err, result) => {
  console.log(err, result)
})

See proxy/index.php or proxy/index.js for examples.

List definition files

The list definition files are in the data/ folder and these are YAML files. The basic structure:

title:
  en: List for something
param:
  ... Definition for a source or several sources

Definition of a source:

language: de
source: https://de.wikipedia.org
pageTitleMatch: Liste der Kunstwerke
renderedFields:
  id:
    column: 2
    regexp: /<a[^>]*>([0-9]+)<\/a>/
    type: html
  wikidata:
    column: 3
    regexp: /<a href="https:\/\/www.wikidata.org\/wiki\/(Q[0-9]+)">Wikidata<\/a>/
    type: html

For sources, the following options are possible

| Field | Description | | ----- | ------------------------------------------ | | language | Language of this list | | source | URL of the Mediawiki / Wikipedia where this list is to be found | | pageTitleMatch | The template title for pages which build this page (e.g. there might be a list of artwork for each town). This is a regular expression for Mediawiki CirrusSearch, so there might be some restrictions. | | template | Mediawiki pages use the specified template (or, when this is an array, templates) for rendering content. | | rawIdField | The id of the item can be read from this field (in the template in page source). | | rawAnchorField | The HTML anchor of the item can be read from this field (in the template in page source). | | rawWikidataField | The wikidata id of the item can be read from this field (in the template in page source). | | renderedTableClass | In rendered output, the table in the page can be detected from this class. | | renderedIdField | The id of the item can be read from this field (in the rendered output, see renderedFields). If the id is empty ('', null, ...), the item will be ignored. | | renderedAnchorField | The HTML anchor of the item can be read from this field (in the rendered output, see renderedFields). | | renderedWikidataField | The wikidata id of the item can be read from this field (in the rendered output, see renderedFields). | | renderedFields | Hashed array of fields, see below. | | wikidataFields | Optionally load the specified list of fields from the matching wikidata item. Example: [{property: P31, field: "is_a"}, ...] |

Advanced Fields:

| Field | Description | | ----- | ------------------------------------------ | | pages | List of pages which constitutes the whole dataset (e.g. for getAll, which returns all items). | rawAnchorTemplate | Complex HTML anchor for the item. Uses Twig syntax to compile the anchor. Available parameters: item.field (with each field from the template), page (page title), index (index of the item on this page). | | rawIdTemplate | More complex ID and aliases for the item. Uses Twig syntax to compile the ID/Aliases (one alias per line). Make sure that the first result is always the same as the first ID in renderedIdTemplate. Available parameters: item.field (with all fields from the template), page (page title), index (index of the item on this page). | | renderedAnchorTemplate | Complex HTML anchor for the item. Uses Twig syntax to compile the anchor. Available parameters: item.field (with each field from the template), page (page title), index (index of the item on this page). | | renderedIdTemplate | More complex ID and aliases for the item. Uses Twig syntax to compile the ID/Aliases (one alias per line). Make sure that the first result is always the same as the first ID in rawIdTemplate. Available parameters: item.field (with all parsed fields from the rendered page), page (page title), index (index of the item on this page). | | wikidataIdTemplate | Additional aliases for the item. Uses Twig syntax to compile the alias (one alias per line). Available parameters: item.P1234 (with all properties specified in wikidataFields). | | idToQuery | When searching for an ID, how to search on the Mediawiki site. idToQuery uses Twig syntax to generate the query, with multiple lines prefixed by a query option and =; available parameter: id (the id we are looking for). Query options: field (which field to query), value (which value to query), wikidataProperty and wikidataValue (value can't be found in the page source, needs to query wikidata first -> use wikidata item id as value), page (doesn't need to search, just load the specified page). |

Rendered Fields Parameter:

| Parameter | Description | | ----- | ------------------------------------------ | | column | Table column | | type | 'html' (default), 'image' (parse url, width, height from first image in this field) | | domQuery | CSS style query for a DOM node in the cell. | domAttribute | Use the value of the DOM node (or the cell, if domQuery was not specified). | regexp | A regular expression, where the first match is the resulting value (to exclude patterns, use: /foo(?:bar)(bla)/ -> "bla". | | modify | A TwigJS template, which can modify value. The following parameters are available: value (if column was specified; the result after column, domQuery, domAttribute, regexp), row (the full table row as array), index (the n'th item on this page), page (the name of the Wikipedia page).