bettong
v0.2.1
Published
Web crawler in JavaScript
Downloads
1
Maintainers
Readme
bettong
WIP: Bettong is a JavaScript Node.js web crawler based in Puppeteer. Based on the provided base URL, Bettong crawls pages on the same origin saves screenshots and HTML content.
Requirements
Bettong uses async/await which is only available in Node.js 8.x.x or higher.
Options
| argument | type | required | default | description |
|--------------------|--------------------|----------|----------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| baseUrl | string | true | null | URL to start crawling from. Bettong will only crawl pages that are on the same origin. |
| outputPath | string | false | "dist" | Relative path screenshots and/or html content will be saved to. Currently screenshots will be saved to provided-relative-path/screenshots and HTML content will be saved to provided-relative-path/html. |
| options | object | false | {} | Bettong options. |
| options.screenshot | boolean | false | true | Whether Bettong should save screenshot for each viewport. |
| options.html | boolean | false | true | Whether Bettong should save HTML content for each page. |
| options.viewports | puppeteer.Viewport | false | [ { width: 540, height: 480, }, { width: 720, height: 480 }, { width: 960, height: 480 }, { width: 1140, height: 480 } ] | Array of viewports used to take screenshots. Only used if options.screenshot
is set to true
. Please puppeteer docs for more information on available properties for Viewport interface. |
Usage
Node.js
Install bettong
npm install --save bettong
const Bettong = require('bettong');
const bettong = new Bettong('https://foo.bar');
await bettong.exec();
CLI
Usage: bettong exec [options] <base-url>
Execute crawling starting from the required base url <base-url>
Options:
-o, --output-path <path> relative output path (default: "dist")
-e, --exclude-pattern <pattern> RegExp page exclude pattern (default: "")
-s, --screenshot <screenshot> save screenshots (default: true)
-h, --html <screenshot> save html content (default: true)
-v, --viewport <viewport> viewport for screenshots, e.g. '{"width":128,"height":128}'
-h, --help output usage information
Install bettong globally
npm install -g bettong
Samples
Start crawling at https://foo.bar and exclude crawling pages that contain 'baz'
in the url.
bettong exec https://foo.bar -e '.*baz.*'
Output
Bettong will save screenshots to provided-relative-path/screenshots
and HTML content to provided-relative-path/html
. By default this would be dist/screenshots
and dist/html
. A screenshot will be saved as a full page screenshot in PNG format for each viewport provided in options.viewports
.