npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

node-nest

v0.8.0

Published

Robust scraping framework

Downloads

9

Readme

Nest

Build Status Dependencies Status

Nest is a high-level, robust framework for web scraping.

Features

  • Dynamic Scraping with a headless browser (Puppeteer)
  • Static scraping with direct HTTP requests without JS evaluation (cheerio)
  • Parallel scraping, worker queue
  • MongoDB integration. State is persisted to Mongo after each operation
  • Minimal API and dead-easy to use

Requirements

  • MongoDB up and running
  • Node

Installation

Install MongoDB.

Also install node-nest in your project:

npm install node-nest

Usage

// Instanciates a new Nest object
var Nest = require('node-nest');
var nest = new Nest();

// Register routes
var someRoute = require('./routes/some-route');
var anotherRoute = require('./routes/another-route');
nest.addRoute(someRoute);
nest.addRoute(anotherRoute);

// Queues scraping operations
nest.queue('some-route', { priority: 90, query: { userId: 123 } });
nest.queue('another-route', { query: { someVar: 'something' } });

// Starts the engine
nest.start();

Example

In this guide, we'll scrape Hackernews articles. To use Nest, you first need to initialize a Nest object:

var Nest = require('node-nest');
var nest = new Nest();

By default, Nest will use the same amount of workers as you have CPU cores. It will also try to connect to a MongoDB running at 127.0.0.1:27017. You can configure these parameters by doing:

var Nest = require('node-nest');

var nest = new Nest({
  workers: 4,         // Set the amount of workers scraping in parallel to 4
  mongo: {
    db: 'nest',       // Use the 'nest' mongo collection
    host: '127.0.0.1' // Connect to the Mongo process running at localhost
    port: '27017'     // Connect to the Mongo process running at port 27017
  }
});

Then you must define some routes. A route is a definition of a site's section, for example a profile page, a post page, or a search results page.

Route

A route defines the URL pattern that matches a particular site section, and describes how the data should be structured out of this page, by explicitly defining a scraping function.

Depending on the returned data from the scraping function, Nest will store the structured scraped data in the mongo database and/or queue more URLs to be scraped.

You can add more routes by using the method nest.addRoute(). Let's define how the "hackernews homepage" route should be scraped:

nest.addRoute({

  // This is the route ID
  key: 'hackernews-homepage',

  // This is the URL pattern corresponding to this route
  url: 'https://news.ycombinator.com',

  // This is the scraper function, defining how this route should be scraped
  scraper: function($) {

    // You should return an object with the following properties:
    // - items:       `Array` Items to save in the database.
    // - jobs:        `Array` New scraping jobs to add to the scraper worker queue
    // - hasNextPage: `Boolean` If true, Nest will scrape the "next page"
    var data = {
      items: []
    };

    // The HTML is already loaded and wrapped with Cheerio in '$',
    // meaning you can get data from the page, jQuery style:
    $('tr.athing').each((i, row) => {
      data.items.push({
        title: $(row).find('a.storylink').text(),
        href: $(row).find('a.storylink').attr('href'),
        postedBy: $(row).find('a.hnuser').text(),

        // this is the only required property in an item object
        key: $(row).attr('id')
      });
    });

    // In this example, Nest will only save the objects
    // stored in 'data.items', into the mongo database
    return data;
  }
});

Then, you need to queue some scraping operations, and start the engine:

nest.queue('hackernews-homepage');
nest.start().then(() => console.log('Engine started!'));

To run this example, just run it with Node. Let's say you called this file "scrape-hackernews.js":

node scrape-hackernews

After running this example, your database will contain 30 scraped items from hackernews, with the following structure:

{
  "_id" : ObjectId("5797199075c2d900da9e3a3e"),
  "key" : "12160127",
  "routeWeight" : 50,
  "routeId" : "hackernews-homepage",
  "href" : "https://github.com/jisaacso/DeepHeart",
  "title" : "DeepHeart: A Neural Network for Predicting Cardiac Health"
},
{
  "_id" : ObjectId("5797199075c2d900da9e3a3d"),
  "key" : "12160374",
  "routeWeight" : 50,
  "routeId" : "hackernews-homepage",
  "href" : "http://www.wsj.com/articles/apple-taps-bob-mansfield-to-oversee-car-project-1469458580",
  "title" : "Apple Taps Bob Mansfield to Oversee Car Project"
},
...etc

Try looking at the scraped data using mongo's native REPL:

mongo nest
> db.items.count()
> db.items.find().pretty()
  • You will see multiple "There are no pending jobs. Retrying in 1s" messages. This is fine. It means that the engine finished processing all the queued jobs, and the workers are just waiting for new jobs.

When running this program again, the route "hackernews-homepage" will not be scraped again, because the state is persisted in Mongo, and Nest doesn't re-scrapes individual URLs that have already been scraped.

You will notice this route is not that helpful, as it is just getting superficial data from each item (The title and the href), and it's only scraping the first page of hackernews.

Let's create a "hackernews post" route, and a new "hackernews articles" route. The new articles route should scrape the first 10 pages of hackernews, and queue a scraping job to "hackernews post" for each scraped article in the articles list. The items in the database will be updated by the new information, after scraping their post pages.

The full example looks as follows:

// in scrape-hackernews.js

var Nest = require('node-nest');

var nest = new Nest();

nest.addRoute({
  key: 'hackernews-post',

  // Route url strings are passed to lodash's 'template' function.
  // You can also provide a function that should return the newly built URL
  // @see https://lodash.com/docs#template
  url: 'https://news.ycombinator.com/item?id=<%= query.id %>',

  scraper: function($) {
    var $post = $('tr.athing').first();

    return {
      items: [{
        key: $post.attr('id'),
        title: $post.find('.title a').text(),
        href: $post.find('.title a').attr('href'),
        postedBy: $post.find('.hnuser').text(),

        // for the sake of this tutorial let's just save most voted comment
        bestComment: $('.comment').first().text()
      }]
    };
  }
});

nest.addRoute({
  key: 'hackernews-articles',

  // the scraping state is available in the URL generator function's scope
  // we can use the "currentPage" property to enable pagination
  url: 'https://news.ycombinator.com/news?p=<%= state.currentPage %>',

  scraper: function($) {
    var currentPage = $('.rank').last().text() / 30;

    var data = {
      items: [],

      // by returning data through the 'jobs' property,
      // you are queueing new scraping operations for the workers to pick up
      jobs: [],

      // if this property is true, the scraper will re-scrape the route,
      // but with the 'state.currentPage' parameter incremented by 1
      //
      // for the sake of this tutorial, let's just scrape the first 5 pages
      hasNextPage: currentPage < 5
    };

    // for each article
    $('tr.athing').each((i, row) => {

      // create superficial hackernews article items in the database
      data.items.push({
        key: $(row).attr('id'),
        title: $(row).find('a.storylink').text(),
        href: $(row).find('a.storylink').attr('href'),
        postedBy: $(row).find('a.hnuser').text()
      });

      // also, queue scraping jobs to the "hackernews-post" route, defined above
      data.jobs.push({
        routeId: 'hackernews-post', // defines which route to be used
        query: { // defines the "query" object, used to build the final URL
          id: $(row).attr('id')
        }
      });
    });

    // Nest will save the objects in 'data.items' and queue jobs in 'data.jobs'
    // Nest won't repeat URLs that have already been scraped
    return data;
  }
});

nest.queue('hackernews-articles');

nest.start();

After running the example, the first worker will go to the articles feed, scrape the 30 articles in the list, store those scraped items in the database, and queue scraping jobs to those articles by their article ID. Then, it will paginate and scrape the next page of the feed.

Meanwhile, the other workers will pick the jobs in the queue, scrape the article pages, and update the article in the database by their article ID.

Remember you can find the full example's code here.

Nest will avoid scraping URLs that have already been scraped

Remember, URLs that have already been scraped will not be scraped again. So, if you make changes to a finite route and want to test your new route, or if you want to repeat your routes, you can delete the finished scraped URLs from the 'jobs' collection by doing:

mongo nest

# This will delete all the finished URLs
> db.jobs.remove({ 'state.finished': true })

# This will only delete finished jobs for a particular route
> db.jobs.remove({ 'state.finished': true, 'routeId': 'my-route-key' })

# WARNING: This will delete every item and job in your database
> db.dropDatabase()

process.env.NEST_DUMP_BROWSER_IO_TO_STD_OUT=1 to dump puppeteer io to stdout.

Engine

By default, Nest will create x amount of workers, where x is the amount of CPU cores you have. Each worker will query for an operation, sorted by priority, run that operation (and spawn a bunch of other operations), and query for another operation again.

Only 1 worker will be querying for an operation at a given time. That is to avoid having multiple workers working on the same op. If there are no unfinished operations, the worker will keep on querying for new ops every second or so.

Tests

npm run test

Cheers.