npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

contract-scraper

v6.0.0

Published

A customisable data scraper for the web based on JSON contracts

Downloads

48

Readme

contract-scraper

With contract-scraper you can easily scrape a HTML page and return the data in a structured format.

Build status npm

Installation

npm install contract-scraper --save
yarn add contract-scraper

Usage

To scrape a page, you can create a new instance of contract-scraper with these parameters:

let contract = {
  itemSelector: 'li',
  puppeteer: true,
  attributes: {
    name: {
      type: 'text',
      selector: '.name'
    },
    link: {
      type: 'link',
      selector: 'a',
      attribute: 'href'
    }
  }
}

const puppeteerOptions = {
  headless: false,
}

const scraper = new Scraper('http://website.com', contract, puppeteerOptions)

A scraper can be initialised with custom puppeteer launch options.

A contract accepts the following properties:

itemSelector (string)

A CSS selector for the element to be scraped. The scraper will process all the elements matching this selector.

puppeteer (boolean)

If set to true contract-scraper will use Puppeteer to load and scrape the page contents

waitForPageLoadSelector (string)

Puppeteer will wait for this CSS selector to exist in the DOM before scraping the page. Must be used in conjunction with pupeeteer: true

attributes (object)

Defines the data to scrape for each item.

Each attribute matches a HTML element to scrape. The attribute type will define how data wil be extracted from the element, and how the data should be formatted in the final output. For example you can use one of the in-built types to extract a number from an element:

<ul>
  <li>
    <div class="name">Iron man</div>
    <div class="price">100 euros</div>
  </li>
  <li>
    <div class="name">Captain America</div>
    <div class="price">500 euros</div>
  </li>
  <ul></ul>
</ul>
const contract = {
  itemSelector: 'li',
  attributes: {
    name: {
      type: 'text',
      selector: '.name',
    },
    price: {
      type: 'number',
      selector: '.price',
    },
  },
};

const scraper = new Scraper('http://characters.com', contract);

scraper.scrapePage().then(items => {
  console.log(items);
  // [
  //   {
  //     name: 'Iron man',
  //     price: 100
  //   },
  //       {
  //     name: 'Captain America',
  //     price: 500
  //   }
  // ]
});

Each attribute can have the following properties:

  • name (string) - A label for this attribute for the final output

  • selector (string) - The CSS selector for the element (scoped to itemSelector).

  • type (string) - A custom type, or one of the in-built ones that returns:

    • background-image: A background-image url from a style string
    • link: An absolute URL
    • number: A number
    • size: A number for size in m².
    • text: Inner text of the element
  • attribute (optional) (string)

    The name of the HTML attribute to scrape data from. E.g. for an element:

    <a href="http://linktoscrape">Homepage</a>
      {
        name: 'URL',
        type: 'link',
        selector: 'a',
        attribute: 'href'
      }

    By default the attribute type will use the innerText of the element if attribute is not specified.

  • data (optional) (object) - If you want to scrape HTML data attributes you can do it in two ways:

    • Directly scraping a data attribute:
      <div data-country="Australia"></div>
      {
        name: 'Country',
        type: 'text',
        selector: 'data-country',
        data: { name: 'country' }
      }
      This will return "Australia" in your list of results.
    • For scraping a JSON value inside a data attribute:
      <div data-price="{currency: 'aud'}"></div>
      {
        name: 'Price',
        type: 'number',
        selector: 'data-price',
        data: { name: 'price', key: 'currency'}
      }
      This will return "aud" in your list of results.

Nested attributes

It's also possible to scrape nested attributes, like a list inside an item:

<ul class="friends">
  <li>
    <span>Spiderman</span>
    <ul>
      <li><strong>Iron</strong><em>Man</em></li>
      <li><strong>Captain</strong><em>America</em></li>
    </ul>
  </li>
</ul>

The contract:

{
  "itemSelector": ".friends li",
  "attributes": {
    "name": { "type": "text", "selector": "span" },
    "friends": {
      "itemSelector": "ul li",
      "attributes": {
        "firstName": { "type": "text", "selector": "strong" },
        "lastName": { "type": "text", "selector": "em" }
      }
    }
  }
}

So this will return all the friends as an array (using any type):

[
  {
    name: 'Spiderman',
    friends: [
      { firstName: 'Iron', lastName: 'Man' },
      { firstName: 'Captain', lastName: 'America' },
    ],
  },
];

Custom attributes types

In addition to the in-built attribute types, you can provide your own when you create a new instance of the scraper. A custom attribute type needs to be a class or a function that has a value property. As a constructor argument it will receive the string innerText value from the matching element. Then you can format it however you like.

For example if you wanted to extract a list of tags and format them as an array:

<ul>
  <li>
    <div class="name">Australia</div>
    <div class="tags">spiders,vegemite,scorching,heat</div>
  </li>
</ul>
import Scraper from 'contract-scraper';

const contract = {
  itemSelector: 'li',
  attributes: {
    countryName: {
      type: 'text',
      selector: '.name',
    },
    tags: {
      type: 'list',
      selector: '.tags',
    },
  },
};

function ListFromString(commaSeparatedString) {
  return commaSeparatedString.split(',');
}

const scraper = new Scraper('http://countries.com', contract, {
  list: ListFromString,
});

scraper.scrapePage().then(items => {
  console.log(items);
  // [
  //   {
  //     countryName: 'Australia',
  //     tags: [ 'spiders', 'vegemite', 'scorching', 'heat' ]
  //   }
  // ]
});

Parsing JSON inside script tags

Sometimes you may want to extract values from inside a script tag on the page. For the moment, contract-scraper only supports parsing JSON. For example:

<html>
  <head>
    <title>Page with a script tag</title>
  </head>
  <body>
    <script type="application/ld+json" id="info">
      {
        "characters": [
          {
            "name": "Jon Snow",
            "friends": [
              { "firstName": "Sansa", "lastName": "Stark" },
              { "firstName": "Bran", "lastName": "Stark" },
              { "firstName": "Arya", "lastName": "Stark" }
            ],
            "photo": "http://images.com/jonsnow",
            "price": {
              "amount": "12345 dollars"
            }
          },
          {
            "name": "Ned Stark",
            "friends": [
              { "firstName": "Sansa", "lastName": "Stark" },
              { "firstName": "Bobby", "lastName": "B" },
              { "firstName": "Little", "lastName": "finger" }
            ],
            "photo": "http://images.com/nedstark",
            "price": {
              "amount": "6789 euros"
            }
          }
        ]
      }
    </script>
  </body>
</html>
const contract = {
  scriptTagSelector: '#info',
  itemSelector: 'characters',
  attributes: {
    name: { type: 'text', selector: 'name' },
    friends: {
      itemSelector: 'friends',
      attributes: {
        firstName: { type: 'text', selector: 'firstName' },
        lastName: { type: 'text', selector: 'lastName' },
      },
    },
    photo: { type: 'link', selector: 'photo' },
    price: { type: 'number', selector: 'price.amount' },
  },
};

const scraper = new Scraper('http://characters.com', contract);

scraper.scrapePage().then(items => {
  console.log(items);
  // [
  //   {
  //     "name": "Jon Snow",
  //     "friends": [
  //       {
  //         "firstName": "Sansa",
  //         "lastName": "Stark"
  //       },
  //       {
  //         "firstName": "Bran",
  //         "lastName": "Stark"
  //       },
  //       {
  //         "firstName": "Arya",
  //         "lastName": "Stark"
  //       }
  //     ],
  //     "photo": "http://images.com/jonsnow",
  //     "price": 12345
  //   },
  //   {
  //     "name": "Ned Stark",
  //     "friends": [
  //       {
  //         "firstName": "Sansa",
  //         "lastName": "Stark"
  //       },
  //       {
  //         "firstName": "Bobby",
  //         "lastName": "B"
  //       },
  //       {
  //         "firstName": "Little",
  //         "lastName": "finger"
  //       }
  //     ],
  //     "photo": "http://images.com/nedstark",
  //     "price": 6789
  //   }
  // ]
});