npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

domwaiter

v1.4.0

Published

A well-behaved URL scraper that brings you delicious DOM objects

Downloads

16,099

Readme

domwaiter

A well-behaved URL scraper that brings you delicious DOM objects

Do you have a large collection of URLs you want to scrape? Scraping one page at a time is too slow, and scraping all the pages at once could put too much stress on the website you're scraping, and it could also crash your Node.js process due to excess memory usage. That's where this package comes in: it has a built-in rate limiter which allows you to quickly (and respectfully) collect those pages, and an event-emitting API to keep memory usage low.

Features

  • Uses Promises so it's async/await friendly
  • Event-emitting API to keep a low memory footprint
  • Supports fetching JSON too (instead of HTML DOM)
  • Rate limiting powered by bottleneck
  • DOM parsing powered by cheerio (optional; can be disabled)
  • HTTP requests powered by got

Installation

npm install domwaiter

Usage

const domwaiter = require('domwaiter')

const pages = [
  { url: 'https://help.github.com/en', language: 'English' },
  { url: 'https://help.github.com/ja', language: 'Japanese' },
  { url: 'https://help.github.com/cn', language: 'Chinese' }
]

domwaiter(pages)
  .on('page', (page) => {
    console.log(page.language, page.$('title').text())
  })
  .on('error', (err) => {
    console.error(err)
  })
  .on('done', () => {
    console.log('Done!')
  })

API

This module exports a single function domwaiter:

domwaiter(pages, [opts])

  • pages Array (required) - Each item in the array must have a url property with a fully-qualified HTTP(S) URL. These object can optionally have other properties, which will be included in the emitted page events. See below.
  • opts Object (optional)
    • parseDOM Boolean - Defaults to true. Set to false if you don't need the parsed page.$ DOM object. Disabling DOM parsing will boost performance.
    • json Boolean - Defaults to false. Set to true if you're fetching JSON instead of HTML. If true, a json property will be present on each emitted page object (and the $ and body properties will NOT be present).
    • maxConcurrent Number - How many jobs can be executing at the same time. Defaults to 5. This option is passed to the underlying bottleneck instance.
    • minTime: Number - How long to wait after launching a job before launching another one. Defaults to 500 (milliseconds). This option is passed to the underlying bottleneck instance.

Events

The domwaiter function returns an event emitter which emits the following events:

  • beforePageLoad - Emitted with page object for any optional prehandling you want to do, e.g. setting up a request timer.
  • page - Emitted after the page has been requested and the response is parsed. Returns an object which is a shallow clone of the original page object you provided, but with two added properties:
    • body: the raw HTTP response body text
    • $: The body parsed into a jQuery-like cheerio DOM object.
  • error - Emitted when an error occurs fetching a URL
  • done - Emitted when all the pages have been fetched.

Tests

npm install
npm test

Dependencies

  • bottleneck: Distributed task scheduler and rate limiter
  • cheerio: Tiny, fast, and elegant implementation of core jQuery designed specifically for the server
  • got: Human-friendly and powerful HTTP request library for Node.js

Dev Dependencies

  • jest: Delightful JavaScript Testing.
  • nock: HTTP server mocking and expectations library for Node.js
  • standard: JavaScript Standard Style