npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

spider-engine

v0.1.3

Published

Web crawling and scraping engine.

Downloads

2

Readme

Spider

NPM version

Web crawling and scraping engine powered by NodeJS.

How to use

To create a new spider, you can do:

var Spider = require('spider-engine');
var spider = new Spider(`scraper` || `options`);

spider.query(`queryParams`);

The spider is an EventEmitter, so you can get the scraped results as they come in, by doing:

spider.on('data', function (data) {
	results = data.items;
	// ...do something with the results
});

spider.on('finish', function(data) {
	console.log('Spider finished with code '+data.code+'. ' + data.message);
});

API

Spider(scraper:Function)

Creates a new basic spider with the provided scraper.

// example:

var Spider = require('spider-engine');

var spider = new Spider( function($) {

	// Get all the links in the page
	var links = [];
	$('a').each( function(i, elem) {
		links.push( $(elem).attr('href') );
	});

	return {
		items: links,
	};
});

spider.query('http://en.wikipedia.org/wiki/Web_scraping');

spider.on('data', function(results) {
	console.log(results); // -> returned data from the scraper;
});

Spider(options:Object)

Creates a new spider engine with the provided parameters.

The following options are supported:

  • urlTemplate ( Function(queryParams) ) The function used to build the query. If no function is provided, the query will be used as is, and the spider won't automatically jumpt to the next page. You can use underscore's _.template function to generate your query templates.
  • scraper ( Function($) ) The function that will be used to process the response's HTML. This function must returns an object containing:
    • items (Array) The items to be scraped. You can build your items freely, this is what the spider will emit when scraping the site.
    • more (Boolean) If the more flag is set, the spider will request the next target. The next target is the same target with the start parameter increased by windowSize.
  • proxy (String) (optional) - The proxy address to use. If no proxy is provided, the local IP will be used instead.
  • defaults (Object) (optional) Default values when building the URL.
  • headers (String) (optional) The headers to be sent as part of the spider's requests.
  • maxRetries (Number) (default: 100) If our IP is blocked, re-try to scrape the results this amount of times.

Spider::query(queryParams:Object)

spider.query('Scraping in nodejs');

The queryParams can be a string, or an object with the following properties:

  • query (String) The query string to use. If a urlTemplate is provided, this string will be available in the construction of the url, under the variable name query. If no urlTemplate is provided, this string will be used as is. In other words, if you do not provide a urlTemplate, make sure to put the whole URL here.
  • windowSize (Number) (default: 100) The window size to use.
  • start (Number) (default: 0) The starting value. If a urlTemplate is provided and the scraper function returns the "more" flag, this number will be increased by windowSize, and the spider will move to the next target (Which is a queryString built with the urlTemplate function provided)

Spider::kill()

Stops the spider. This will also trigger the finish event.

spider.kill();

Events

Spider inherits from EventEmitter, so the following events can be emitted from a spider:

  • start - The spider has started.
  • move - The spider started a HTTP request
  • data - The spider scraped and returned results
  • ipBlocked - Our IP gets rejected from the server (Useful for logging, or to handle IP changes. Just saying.)
  • finish - The spider has finished.

Tests

make test

Cheers.