npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

puppet-scraper

v0.2.1-canary.2

Published

Scraping using Puppeteer the sane way 🤹🏻‍♂️

Downloads

12

Readme

puppet-scraper

github release npm version



PuppetScraper is a opinionated wrapper library for utilizing Puppeteer to scrape pages easily, bootstrapped using Jared Palmer's tsdx.

Most people create a new scraping project by require-ing Puppeteer and create their own logic to scrape pages, and that logic will get more complicated when trying to use multiple pages.

PuppetScraper allows you to just pass the URLs to scrape, the function to evaluate (the scraping logic), and how many pages (or tabs) to open at a time. Basically, PuppetScraper abstracts the need to create multiple page instances and retrying the evaluation logic.

Version 0.1.0 note: PuppetScraper was initially made as a project template rather than a wrapper library, but the core logic is still the same with various improvements and without extra libraries, so you can include PuppetScraper in your project easily using npm or yarn.

Brief example

Here's a basic example on scraping the entries on first page Hacker News:

// examples/hn.js

const { PuppetScraper } = require('puppet-scraper');

const ps = await PuppetScraper.launch();

const data = await ps.scrapeFromUrl({
  url: 'https://news.ycombinator.com',
  evaluateFn: () => {
    let items = [];

    document.querySelectorAll('.storylink').forEach((node) => {
      items.push({
        title: node.innerText,
        url: node.href,
      });
    });

    return items;
  },
});

console.log({ data });

await ps.close();

View more examples on the examples directory.

Usage

Installing dependency

Install puppet-scraper via npm or yarn:

$ npm install puppet-scraper
      --- or ---
$ yarn add puppet-scraper

Install peer dependency puppeteer or Puppeteer equivalent (chrome-aws-lambda, untested):

$ npm install puppeteer
      --- or ---
$ yarn add puppeteer

Instantiation

Create the PuppetScraper instance, either launching a new browser instance, connect or use an existing browser instance:

const { PuppetScraper } = require('puppet-scraper');
const Puppeteer = require('puppeteer');

// launches a new browser instance
const instance = await PuppetScraper.launch();

// connect to an existing browser instance
const external = await PuppetScraper.connect({
  browserWSEndpoint: 'ws://127.0.0.1:9222/devtools/browser/...',
});

// use an existing browser instance
const browser = await Puppeteer.launch();
const existing = await PuppetScraper.use({ browser });

Customize options

launch and connect has the same props with Puppeteer.launch and Puppeteer.connect, but with an extra concurrentPages and maxEvaluationRetries property:

const { PuppetScraper } = require('puppet-scraper');

const instance = await PuppetScraper.launch({
  concurrentPages: 3,
  maxEvaluationRetries: 10
  headless: false,
});

concurrentPages is for how many pages/tabs will be opened and use for scraping.

maxEvaluationRetries is for how many times the page will try to evaluate the given function on evaluateFn (see below), where if the evaluation throws an error, the page will reload and try to re-evaluate again.

If concurrentPages and maxEvaluationRetries is not determined, it will use the default values:

export const DEFAULT_CONCURRENT_PAGES = 3;
export const DEFAULT_EVALUATION_RETRIES = 10;

Scraping single page

As shown like the example above, use .scrapeFromUrl and pass an object with the following properties:

  • url: string, page URL to be opened
  • evaluateFn: function, function to evaluate (scraper method)
  • pageOptions: object, Puppeteer.DirectNavigationOptions props to override page behaviors
const data = await instance.scrapeFromUrl({
  url: 'https://news.ycombinator.com',
  evaluateFn: () => {
    let items = [];

    document.querySelectorAll('.storylink').forEach((node) => {
      items.push({
        title: node.innerText,
        url: node.href,
      });
    });

    return items;
  },
});

pageOptions defaults the waitUntil property to networkidle0, which you can read more on the API documentation.

Scraping multiple pages

Same as .scrapeFromUrl but passes urls property which contain strings of URL:

  • urls: string[], page URLs to be opened
  • evaluateFn: function, function to evaluate (scraper method)
  • pageOptions: object, Puppeteer.DirectNavigationOptions props to override page behaviors
const urls = Array.from({ length: 5 }).map(
  (_, i) => `https://news.ycombinator.com/news?p=${i + 1}`,
);

const data = await instance.scrapeFromUrls({
  urls,
  evaluateFn: () => {
    let items = [];

    document.querySelectorAll('.storylink').forEach((node) => {
      items.push({
        title: node.innerText,
        url: node.href,
      });
    });

    return items;
  },
});

Closing instance

When there's nothing left to do, don't forget to close the instance with closes the browser:

await instance.close();

Access the browser instance

PuppetScraper also exposes the browser instance if you want to do things manually:

const browser = instance.___internal.browser;

Contributing

Thanks goes to these wonderful people (emoji key):

This project follows the all-contributors specification. Contributions of any kind welcome!

License

MIT License, Copyright (c) 2020 Griko Nibras