npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

jedi-crawler

v0.0.3

Published

Lightsabing Node/PhantomJS crawler. Crawl almost everything, including AJAX content.

Downloads

24

Readme

JEDI CRAWLER

Da fuq?

JEDI CRAWLER is a Node/PhantomJS crawler made to scrape pretty much anything from Node, with a really simple syntax. Work in progress ladies

How does it work

Register padawans to the jedi crawler, that have a pattern to match URL, and jQuery-style selectors. You can also post-process the data.

module.exports = function(jedi) {

  jedi.registerPadawan({
    // Pattern to match URL
    pattern: /en.wikipedia.org\/wiki\//,
    // Selectors to be executed
    selectors:{
      title:{
        sel: "#firstHeading span",
        type: "text"
      },
      firstParagraph:{
        sel: "#toc ~ p:first",
        type: "text"
      }
    },
    // You can choose to process the data AFTER being crawled.
    postProcessing: function(data) {
      /// Do your custom processing on the data processed
      data.title = data.title.toUpperCase();
      return data;
    }
  });

};

For now only two types of selectors are supported : "text" and "src"

I find having one file per padawan (crawler) pretty cool for code clarity and also padawans need to learn by themselve and be alone

You can then give your padawans to the Jedi by doing

var jedi = require('./modules/jedi');
require('./padawans/wikipedia')(jedi);

And then you can do

jedi.crawl('http://en.wikipedia.org/whatever', function(err, result){
  console.log(err);
  console.log(result);
});

Special features

Crawlers only start to scrape the page as soon as $(document).ready is fired. Our own version of jQuery is injected into the page, but then we also give back the $ to its owner in case they're executing 3rd party libraries to modify the DOM or w/e

If your selectors matches severals DOM elements, then an array of every value is returned

Right now, PhantomJS is instantiated with "--load-images=no" option so the page loads faster

Test it now

Pull that bad boy Make sure you have PhantomJS installed Run node main.js