npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

node-crawling-framework

v0.0.1-alpha.2

Published

NodeJs crawling & scraping framework heavily inspired by Scrapy (Pyhton)

Downloads

31

Readme

node-crawling-framework

Current stage: aplha (Work in progress)

"node-crawling-framework" is a crawling & scraping framework for NodeJs heavily inspired by Scrapy.

A node job server is also in motion (kinda scrapyd equivalent based on BullJs).

Features (not fully tested and finalized)

The core is working: Crawler, Scraper, Spider, item processors (pipeline), DownloadManager, downloader.

  • Modular and easily extendable architecture through middlewares and class inheritance:

    • add your own middlewares for spiders, item-processors, and downloaders.
    • extend framework spiders and get some features for free.
  • DownloadManager: delay and concurency limit settings,

  • RequestDownloader: downloader based on request package,

  • Downloader middlewares:

    • cookie: handle cookie storage between requests,
    • defaultHeaders: add default headers to each request,
    • retry: retry requests on error,
    • stats: collect some stats during the crawling (requests & errors count, ...)
  • Spiders:

    • BaseSpider: every spider must inherhit from this one,
    • Sitemap: parse sitemap and feed the spider with found urls,
    • Elasticsearch: feed spider urls with elasticsearch
  • Spider middlewares:

    • cheerio: cheerio helper on response to get a cheerio object,
    • scrapeUtils: cheerio + some helpers to facilitate the scraping (methods: scrape, scrapeUrl, scrapeRequest, ...),
    • filterDomains: filter non authorized domains
  • Item processor middlewares:

    • printConsole: log items to the console,
    • jsonLineFileExporter: write scraped items to a json file, one line = one json (easier to parse atferwards, smaller memory footprint),
    • logger: log items to the logger,
    • elasticsearchExporter: export items to elasticsearch
  • Logger: configurable logger (default: console)

Project example

See Quotesbot

Spider example

const { BaseSpider } = require('node-crawling-framework');

class CssSpider extends BaseSpider {
  constructor() {
    super();
    this.startUrls = ['http://quotes.toscrape.com'];
  }

  *parse(response) {
    const quotes = response.scrape('div.quote');
    for (let quote of quotes) {
      yield {
        text: quote.scrape('span.text').text(),
        author: quote.scrape('small.author').text(),
        tags: quote.scrape('div.tags > a.tag').text()
      };
    }
    yield response.scrapeRequest({ selector: '.next > a' });
  }
}

module.exports = CssSpider;

Crawler configuration example

module.exports = {
  settings: {
    maxDownloadConcurency: 1, // maximum download concurrency, default: 1
    filterDuplicateRequests: true, // filter already scraped requests, default: true
    delay: 100, // delay in ms between requests, default: 0
    maxConcurrentScraping: 500, // maximum concurrent scraping, default: 500
    maxConcurrentItemsProcessingPerResponse: 100, // maximum concurrent item processing per response, default: 100
    autoCloseOnIdle: true // auto close crawler when crawling is finished, default:true
  },
  logger: null, // logger, must implement console interface, default: console
  spider: {
    type: '', // spider to use for crawling, search spider in ${cwd} or ${cwd}/spiders, can also be a class definition object
    options: {}, // spider constructor args
    middlewares: {
      scrapeUtils: {}, // add utils methods to the response, ex: "response.scrape()"
      filterDomains: {} // avoid unwanted domain requests from being scheduled
    }
  },
  itemProcessor: {
    middlewares: {
      jsonLineFileExporter: {}, // write scraped items to a json file, one line = one json (easier to parse atferwards, smaller memory footprint)
      logger: {} // log scraped items through the crawler logger
    }
  },
  downloader: {
    type: 'RequestDownloader', // downloader to use, can also be a class definition object
    options: {}, // downloader constructor args
    middlewares: {
      stats: {}, // give some stats about requests, ex: number of requests/errors
      retry: {}, // retry on failed requests
      cookie: {} // store cookie between requests
    }
  }
};

Crawler instantiation example

const { createCrawler } = require('node-crawling-framework');

const config = require('./config');
const crawler = createCrawler(config, 'CssSpider');

crawler.crawl().then(() => {
  console.log('✨  Crawling done');
});

TODO list

  • Add unit tests
  • Add documentation
  • Add MongoDb feeder/exporter
  • Make some benchmarks ?
  • Finish formRequest scraping ( add clickables elements)
  • Utils: add date parse(moment wrapper), datapager helper ?
  • adding multi spider support ?
  • add crawling queue to settings / possibility to override the queue (could allow shared redis queue for distributed crawling)
  • allow to override/set/configure DownloadManager: could allow proxy pool handling for example
  • Puppeteer downloader:
    • be compatible with header and cookie middlewares
  • Split plugins/middlewares in packages
  • Command line tool, "ncf-cli"
    • scaffolding: create project (with wizard), spider, any middleware
    • crawl: launch crawl
    • deploy: deploy to node-job-server
  • find solution for Dns Caching
  • middleware to respect "robots.txt"
  • limit max reponse size
  • auto throttle