npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

painless-crawler

v0.0.2-alpha

Published

A painless Node.js web crawler that simply works

Downloads

4

Readme

Painless Crawler

Build Status

NPM

Introduction

Working with some of the top Node.js crawlers on GitHub have led to much frustration with regards to getting something which simply works, and I ended up spending many hours playing around and figuring out how they were supposed to work.

As such, I wanted to build a first and foremost painless crawler that just works with clear documetation, which allows the user to focus on more important tasks.

In addition, I intend to use this project to expose myself to testing and continuous integration.

Installation

$ npm install --save painless-crawler

Features

  • Hyperlinks found are resolved and validated
  • Provided wrappers for server-side jQuery

Getting Started

Even though I encountered many difficulties with Node-Crawler I really liked the simple implementation of how the crawler is used, and as much as possible, I tried to follow its usage pattern.

A crawler object is essentially a task queue of urls that will be continuously acted on. The example below will visit the homepage of TechCrunch, find and resolve all links on the page, and add all the urls found back into the task queue.

var PainlessCrawler = require('painless-crawler')

// First create a new crawler object 
// Define a callback for all items
var crawler = new PainlessCrawler(function (error, response, linksFound, $) {
    if (error) {
        console.error(error);
        return;
    }

    console.log('Crawled url: ' + response.request.uri.href);
    console.log('Links Found: ');
    console.log(linksFound);

    // Do things with jquery '$'
    // ...
	
	// Add all links on page back into the queue
	linksFound.forEach(function(link) {
		crawler.queue(link);
	});
});

// Add a url to the task queue to kick start the crawler
var URL = 'http://techcrunch.com';
crawler.queue(URL);

Task Callback

The callback for crawling is in the form below with the corresponding arguments

var callback = function (error, response, linksFound, $) {
	if (error) {
		console.error(error);
		return;
	}
	
	// ...
}

error String

Specifies the error (if any)

response Object

Response for the http request made, an instance of http.IncomingMessage

linksFound Array

An array of hyperlinks that are found on the page. All links found will be validated, stripped of any fragment identifiers, and relative links will be resolved to absolute ones.

$

A Cheerio object that represents the document and contains jQuery-esque functions

Hierarchy

Callbacks can be defined on different levels, and the crawler will search up the hierarchy to find a valid callback in the order stated below.

  1. Task Configuration
  2. Parameter in crawler.queue()
  3. Constructor

If no callback is defined, an error will be thrown.

Usage

Constructor

new PainlessCrawler([[options, ] callback])

  • options Object
  • callback function

Example:


var options = {
	maxConnections: 25;
}

var crawler = new PainlessCrawler(options, callback)

Queue

painlessCrawler.queue(task[, callback])

  • task String | Object | Array
  • callback Function

task String

crawler.queue('http://www.google.com', callback);

task Object

Task configurations can also be passed into the queue as an object. A task configuration is an object that contains a url and a callback for the task.

  • task.url String
  • task.callback function (optional)
var task0 = {
	url: 'http://www.google.com',
	callback: function (error, response, linksFound, $) {
		if (error) {
			console.error(error);
			return;
		}
		
		// ...
	}
}
crawler.queue(task0);

Note that if a task configuration is provided with a valid callback, it will take precedence over the callback provided as the second parameter of queue.

// myCallback will be ignored
crawler.queue(task0, myCallback);

task Array

An array of task objects can be passed into the queue as well.

If a valid callback is provided in each task object, that callback will be executed, else the crawler will go up the callback hierarchy to find a callback.

var taskConfigs = [task0, task1, task2];
crawler.queue(taskConfigs);

Tests

Run tests:

$ npm test

Testing in Docker is also supported

$ docker build -t painless-crawler-test ./test
$ docker run painless-crawler-test