npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

jdistiller

v2.0.0

Published

A page scraping DSL for extracting structured information from unstructured XHTML, built on Node.js and jQuery.

Downloads

22

Readme

jDistiller

Author: @benjamincoe

Over my past couple years in the industry, there have been several times where I need to scrape structured information from (relatively) unstructured XHTML websites.

My approach to doing this has gradually evolved to include the following technologies:

I was starting to notice a lot of code duplication in my scraping scripts, enter jDistiller:

What is jDistiller?

  • jDistiller is a simple and powerful DSL for scraping structured information from XHTML websites.
  • it is built on jQuery and Node.js.
  • it grows out of my experiences, having built several one-off page scrapers.

Installation

npm install jdistiller

The DSL

  • first you create an instance of the jDistiller object:
var jDistiller = require('jdistiller').jDistiller;
new jDistiller()
  • the set() method is used to specify key/css-selector pairs to scrape data from:
new jDistiller()
	.set('headline', '#article h1.articleHeadline')
	.set('firstParagraph', '#article .articleBody p:eq(0)');
  • when the distill() method is called, with an URL as input, a JavaScript object will be returned populated with the scraped data.

Simple Example (New York Times)

var jDistiller = require('jdistiller').jDistiller;

new jDistiller()
	.set('headline', '#article h1.articleHeadline')
	.set('firstParagraph', '#article .articleBody p:eq(0)')
	.distill('http://www.nytimes.com/2012/09/09/us/politics/obama-and-romney-battle-for-votes-in-2-swing-states.html?_r=1&hp', function(err, distilledPage) {
		console.log(JSON.stringify(distilledPage))
	});

Output

{"headline":"Obama Tries to Turn Focus to Medicare From Jobs Figures","firstParagraph":"SEMINOLE, Fla. — President Obama on Saturday began hammering away at the Republican ticket’s plans for Medicare, using a campaign swing through Florida, with its large number of retired and elderly voters, to try to turn the page from anemic employment growth, his biggest weakness, to entitlements, a Democratic strength."}

An Optional Closure can be Provided for Processing the Value

A closure can optionally be provided as the third parameter for the set() method.

If a closure is given, the return value of the closure will be set as a key's value, rather than the text value of the selector.

DSL Using an Optional Data Processing Closure

var jDistiller = require('jdistiller').jDistiller;

new jDistiller()
	.set('headline', '#article h1.articleHeadline')
	.set('firstParagraph', '#article .articleBody p:eq(0)')
	.set('image', '#article .articleBody .articleSpanImage img', function(element, prev) {
		return element.attr('src')
	})
	.distill('http://www.nytimes.com/2012/09/09/us/politics/obama-and-romney-battle-for-votes-in-2-swing-states.html?_r=1&hp', function(err, distilledPage) {
		console.log(JSON.stringify(distilledPage))
	});

Output

{"headline":"Obama Tries to Turn Focus to Medicare From Jobs Figures","firstParagraph":"SEMINOLE, Fla. — President Obama on Saturday began hammering away at the Republican ticket’s plans for Medicare, using a campaign swing through Florida, with its large number of retired and elderly voters, to try to turn the page from anemic employment growth, his biggest weakness, to entitlements, a Democratic strength.","image":"http://graphics8.nytimes.com/images/2012/09/09/us/JP-CANDIDATE-1/JP-CANDIDATE-1-articleLarge.jpg"}

The closure will be passed the following values:

  • element: a jQuery element matching the CSS selector specified in set().
  • prev: if multiple elements on the page match the selector, the closure is will be executed once for each. prev can be used to interact with the object created by previous executions of the closure. As an example, we might want to increment a counter if the same link occurs multiple times on the same page.
  • this: the state is shared between multiple executions of the same closure (see examples/wikipedia.js, to get an idea of why this is useful).

Closure Return Types

  • strings: the last string returned by the closure will be used as the value.
  • numbers: the last number returned by the closure will be used as the value.
  • arrays: when an array is returned, it will be merged with all other arrays returned for the given key. The final merged array will be set as value.
  • objects: when an object is returned, the object will be merged with all other objects returned. The final object will be used as the value.
  • key/object-pair: this special return type allows value to be populated with an object that has dynamically generated key names.

Some Examples

Array Merging Example

var jDistiller = require('jdistiller').jDistiller;

new jDistiller()
	.set('paragraphs', '#article .articleBody p', function(element) {
		return [element.text()]
	})
	.distill('http://www.nytimes.com/2012/09/09/us/politics/obama-and-romney-battle-for-votes-in-2-swing-states.html?_r=1&hp', function(err, distilledPage) {
		console.log(JSON.stringify(distilledPage))
	});

output

{"paragraphs": ["SEMINOLE, Fla. — President Obama on Saturday began hammering away at the Republican ticket’s...", "Kicking off a two-day bus tour through...", ...]}

Object Merging Example

var jDistiller = require('jdistiller').jDistiller;

new jDistiller()
	.set('headlines', '.mw-headline', function(element) {
		this.count = this.count || 0;
		this.count ++;
		if (this.count === 2) {
			return {
				'second_heading': element.text().trim()
			}
		}
		if (this.count === 3) {
			return {
				'third_heading': element.text().trim()
			}
		}
	})
	.distill('http://en.wikipedia.org/wiki/Dog', function(err, distilledPage) {
		console.log(JSON.stringify(distilledPage));
	});

Output

{"headlines":{"second_heading":"Taxonomy","third_heading":"History and evolution"}}

Key/Object-Pair Example

var jDistiller = require('jdistiller').jDistiller;

new jDistiller()
	.set('links', '#bodyContent p a', function(element, prev) {
		var key = element.attr('href');
		return [key, {
			title: element.attr('title'),
			href: key,
			occurrences: prev[key] ? prev[key].occurrences + 1 : 1
		}]
	})
	.distill('http://en.wikipedia.org/wiki/Dog', function(err, distilledPage) {
		console.log(JSON.stringify(distilledPage));
	});

Output

{"links":{"#cite_note-MSW3_Lupus-1":{"title":"","href":"#cite_note-MSW3_Lupus-1","occurrences":1},"#cite_note-ADW-2":{"title":"","href":"#cite_note-ADW-2","occurrences":1},"/wiki/Gray_wolf_subspecies":{"title":"Gray wolf subspecies","href":"/wiki/Gray_wolf_subspecies","occurrences":1},"/wiki/Gray_wolf":{"title":"Gray wolf","href":"/wiki/Gray_wolf","occurrences":1},"/wiki/Canidae":{"title":"Canidae","href":"/wiki/Canidae","occurrences":1}}}

That's About It

I'm excited about jDistiller, I think it solves the scraping problem in an elegant way.

Don't be shy with your feedback, and please contribute.

-- Ben @benjamincoe