npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

snoospider

v1.3.3

Published

A Node.js spider for scraping reddit.

Downloads

10

Readme

snoospider Build Status

A Node.js spider for scraping reddit.

Features

(See documentation for comprehensive features and examples.)

snoospider grants scraping of submissions, comments, and replies as delineated by subreddit and a specified Unix time frame in seconds, without being bogged down by the learning curve of reddit's API, the snoowrap wrapper for reddit's API, and Bluebird promises used by the wrapper.

If the directory option is supplied to an instance of snoospider, the spider will output JSON files with all fields and metadata to that relative directory, most importantly the body field of comments. This makes it easy to analyze the files with another tool, e.g., R and its RJSONIO package. A callback option can also be supplied to pass scraped JSON arrays into a function such as console.log(...) or any other for direct processing in JavaScript without the need for file I/O.

NOTE: If your use case falls outside of snoospider's scope, then you should move on to snoowrap—it is much more powerful than snoospider, but its learning curve is far greater for complex tasks.

Installation

First, to install snoospider as a dependency for your project, run:

npm install snoospider --save

Second, set up OAuth by running reddit-oauth-helper and following the directions:

npm install -g reddit-oauth-helper
reddit-oauth-helper
  1. Select permanent token duration.

  2. Select read and mysubreddits scope. Through reddit, you must subscribe to the subreddits you want to scrape with the account you provide to reddit-oauth-helper.

Third, you should have retrieved some JSON output. You will take some of its contents and place them in another file.

  1. Create a file called credentials.json.

  2. In credentials.json, fill in your information from reddit and reddit-outh-helper:

{
  "client_id": "",
  "client_secret": "",
  "refresh_token": "",
  "author": "/u/YourRedditUsername"
}

Usage

You may create a JavaScript file like this:

'use strict';

let currentCrawlTime = Date.UTC(2016, 1, 1, 21) / 1000;

const SNOOSPIDER = require('snoospider'),
      CREDENTIALS = require('path/to/credentials.json'),
      OPTIONS = {
        subreddit: 'funny',
        startUnixTime: currentCrawlTime,
        endUnixTime: currentCrawlTime + 60 * 60,
        numSubmissions: 3,
        directory: './',
        callback: console.log,
        sort: 'top',
        comments: {
          depth: 1,
          limit: 1
        }
      };

let spider = new SNOOSPIDER(CREDENTIALS, OPTIONS);

spider.crawl();

This file, let's say it's called test.js, can be run with the following command:

node --harmony test.js

Based on the provided options, spider.crawl() will output one file and all results to the console.

A few notes on the example file:

  • The --harmony flag must be used for ES6 syntax, which snoospider uses.
  • If options.comments is not specified, only submissions are crawled instead.
  • directory, callback, or both must be specified.
  • callback is simply a function that executes after the spider is done crawling. You can choose whether it actually processes data from the spider by providing a parameter for it to do so.

Advanced Usage

The following code outputs files of submissions and corresponding comments for each day of February 2016, from 3pm to 4pm, PST.

'use strict';

let currentCrawlTime = Date.UTC(2016, 1, 1, 21) / 1000;

const DAY_IN_SECONDS = 24 * 60 * 60,
      HOUR_IN_SECONDS = 60 * 60,
      END_FEB = Date.UTC(2016, 1, 29, 23, 59, 59) / 1000,
      SNOOSPIDER = require('path/to/snoospider/src/snoospider.js'),
      CREDENTIALS = require('path/to/credentials.json'),
      OPTIONS = {
        subreddit: 'sports',
        startUnixTime: currentCrawlTime,
        endUnixTime: currentCrawlTime + HOUR_IN_SECONDS,
        numSubmissions: 8,
        directory: './output/',
        callback: step,
        sort: 'comments',
        comments: {
            depth: 1,
            limit: 2
        }
      };

let spider = new SNOOSPIDER(CREDENTIALS, OPTIONS);

function step() {
	currentCrawlTime += DAY_IN_SECONDS;

	spider.setStartUnixTime(currentCrawlTime);
	spider.setEndUnixTime(currentCrawlTime + HOUR_IN_SECONDS);

	if (currentCrawlTime < END_FEB) spider.crawl();
}

spider.crawl();

Note how step is passed as the callback to the spider instance to allow for synchronous iteration.

File Output

Outputted files should look something like this, their filenames being {subreddit}-{Unix time in milliseconds}:

[
  {
    "program": "snoospider",
    "version": "0.13.0",
    "blame": "/u/YourRedditUsername",
    "parameters": {
      "subreddit": "funny",
      "startUnixTime": 1436369760,
      "endUnixTime": 1439048160,
      "numSubmissions": 3,
      "directory": './',
      "comments": {
        "depth": 1,
        "limit": 1
      }
    }
  },
  {
    "...SUBMISSION METADATA...": {
      "comments": [
        {
          "...": "...",
          "replies": [
            {
              "...": "..."
            }
          ]
        }
      ]
    }
  },
  {
    "...AND SO ON": "AND SO FORTH..."
  }
]

Note that the first object in the JSON array is metadata associated with snoospider.