domwaiter
v1.4.0
Published
A well-behaved URL scraper that brings you delicious DOM objects
Downloads
16,099
Readme
domwaiter
A well-behaved URL scraper that brings you delicious DOM objects
Do you have a large collection of URLs you want to scrape? Scraping one page at a time is too slow, and scraping all the pages at once could put too much stress on the website you're scraping, and it could also crash your Node.js process due to excess memory usage. That's where this package comes in: it has a built-in rate limiter which allows you to quickly (and respectfully) collect those pages, and an event-emitting API to keep memory usage low.
Features
- Uses Promises so it's async/await friendly
- Event-emitting API to keep a low memory footprint
- Supports fetching JSON too (instead of HTML DOM)
- Rate limiting powered by bottleneck
- DOM parsing powered by cheerio (optional; can be disabled)
- HTTP requests powered by got
Installation
npm install domwaiter
Usage
const domwaiter = require('domwaiter')
const pages = [
{ url: 'https://help.github.com/en', language: 'English' },
{ url: 'https://help.github.com/ja', language: 'Japanese' },
{ url: 'https://help.github.com/cn', language: 'Chinese' }
]
domwaiter(pages)
.on('page', (page) => {
console.log(page.language, page.$('title').text())
})
.on('error', (err) => {
console.error(err)
})
.on('done', () => {
console.log('Done!')
})
API
This module exports a single function domwaiter
:
domwaiter(pages, [opts])
pages
Array (required) - Each item in the array must have aurl
property with a fully-qualified HTTP(S) URL. These object can optionally have other properties, which will be included in the emittedpage
events. See below.opts
Object (optional)parseDOM
Boolean - Defaults totrue
. Set tofalse
if you don't need the parsedpage.$
DOM object. Disabling DOM parsing will boost performance.json
Boolean - Defaults tofalse
. Set totrue
if you're fetching JSON instead of HTML. Iftrue
, ajson
property will be present on each emittedpage
object (and the$
andbody
properties will NOT be present).maxConcurrent
Number - How many jobs can be executing at the same time. Defaults to5
. This option is passed to the underlying bottleneck instance.minTime
: Number - How long to wait after launching a job before launching another one. Defaults to500
(milliseconds). This option is passed to the underlying bottleneck instance.
Events
The domwaiter
function returns an event emitter which emits the following events:
beforePageLoad
- Emitted withpage
object for any optional prehandling you want to do, e.g. setting up a request timer.page
- Emitted after the page has been requested and the response is parsed. Returns an object which is a shallow clone of the originalpage
object you provided, but with two added properties:body
: the raw HTTP response body text$
: The body parsed into a jQuery-like cheerio DOM object.
error
- Emitted when an error occurs fetching a URLdone
- Emitted when all the pages have been fetched.
Tests
npm install
npm test
Dependencies
- bottleneck: Distributed task scheduler and rate limiter
- cheerio: Tiny, fast, and elegant implementation of core jQuery designed specifically for the server
- got: Human-friendly and powerful HTTP request library for Node.js