crawlerr
v1.5.0
Published
A simple and fully customizable web crawler/spider for Node.js with server-side DOM. Comes with elegant and hell-simple APIs.
Downloads
5
Maintainers
Readme
crawlerr is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on wildcards. It uses Bloom filter for caching. A browser-like feeling.
- Simple: our crawler is simple to use;
- Elegant: provides a verbose, Express-like API;
- MIT Licensed: free for personal and commercial use;
- Server-side DOM: we use JSDOM to make you feel like in your browser;
- Configurable pool size, retries, rate limit and more;
Installation
$ npm install crawlerr
Usage
crawlerr(base [, options])
You can find several examples in the examples/
directory. There are the some of the most important ones:
Example 1: Requesting title from a page
const spider = crawlerr("http://google.com/");
spider.get("/")
.then(({ req, res, uri }) => console.log(res.document.title))
.catch(error => console.log(error));
Example 2: Scanning a website for specific links
const spider = crawlerr("http://blog.npmjs.org/");
spider.when("/post/[digit:id]/[all:slug]", ({ req, res, uri }) => {
const post = req.param("id");
const slug = req.param("slug").split("?")[0];
console.log(`Found post with id: ${post} (${slug})`);
});
Example 3: Server side DOM
const spider = crawlerr("http://example.com/");
spider.get("/").then(({ req, res, uri }) => {
const document = res.document;
const elementA = document.getElementById("someElement");
const elementB = document.querySelector(".anotherForm");
console.log(element.innerHTML);
});
Example 4: Setting cookies
const url = "http://example.com/";
const spider = crawlerr(url);
spider.request.setCookie(spider.request.cookie("foobar=…"), url);
spider.request.setCookie(spider.request.cookie("session=…"), url);
spider.get("/profile").then(({ req, res, uri }) => {
//… spider.request.getCookieString(url);
//… spider.request.setCookies(url);
});
API
crawlerr(base [, options])
Creates a new Crawlerr
instance for a specific website with custom options
. All routes will be resolved to base
.
| Option | Default | Description |
|:-------------|:--------|:-----------------------------------------------|
| concurrent
| 10
| How many request can be run simultaneously |
| interval
| 250
| How often should new request be send (in ms) |
| … | null
| See request
defaults for more informations |
public .get(url)
Requests url
. Returns a Promise
which resolves with { req, res, uri }
, where:
req
is the Request object;res
is the Response object;uri
is the absoluteurl
(resolved frombase
).
Example:
spider
.get("/")
.then(({ res, req, uri }) => …);
public .when(pattern)
Searches the entire website for urls which match the specified pattern
. pattern
can include named wildcards which can be then retrieved in the response via res.param
.
Example:
spider
.when("/users/[digit:userId]/repos/[digit:repoId]", ({ res, req, uri }) => …);
public .on(event, callback)
Executes a callback
for a given event
. For more informations about which events are emitted, refer to queue-promise.
Example:
spider.on("error", …);
spider.on("resolve", …);
public .start()
/.stop()
Starts/stops the crawler.
Example:
spider.start();
spider.stop();
public .request
A configured request
object which is used by retry-request
when crawling webpages. Extends from request.jar()
. Can be configured when initializing a new crawler instance through options
. See crawler options and request
documentation for more informations.
Example:
const url = "https://example.com";
const spider = crawlerr(url);
const request = spider.request;
request.post(`${url}/login`, (err, res, body) => {
request.setCookie(request.cookie("session=…"), url);
// Next requests will include this cookie
spider.get("/profile").then(…);
spider.get("/settings").then(…);
});
Request
Extends the default Node.js
incoming message.
public get(header)
Returns the value of a HTTP header
. The Referrer
header field is special-cased, both Referrer
and Referer
are interchangeable.
Example:
req.get("Content-Type"); // => "text/plain"
req.get("content-type"); // => "text/plain"
public is(...types)
Check if the incoming request contains the "Content-Type" header field, and it contains the give mime type
. Based on type-is.
Example:
// Returns true with "Content-Type: text/html; charset=utf-8"
req.is("html");
req.is("text/html");
req.is("text/*");
public param(name [, default])
Return the value of param name
when present or defaultValue
:
- checks route placeholders, ex:
user/[all:username]
; - checks body params, ex:
id=12, {"id":12}
; - checks query string params, ex:
?id=12
;
Example:
// .when("/users/[all:username]/[digit:someID]")
req.param("username"); // /users/foobar/123456 => foobar
req.param("someID"); // /users/foobar/123456 => 123456
Response
public jsdom
Returns the JSDOM object.
public window
Returns the DOM window for response content. Based on JSDOM.
public document
Returns the DOM document for response content. Based on JSDOM.
Example:
res.document.getElementById(…);
res.document.getElementsByTagName(…);
// …
Tests
npm test