npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

grunt-scrape

v0.1.1

Published

Grunt wrapper around scrape plugin.

Downloads

5

Readme

grunt-scrape v.0.1.0

A grunt wrapper around node-scrape plugin. Allows to scrape webpages and extract data collections.

Getting Started

This plugin requires Grunt ~0.4.5

If you haven't used Grunt before, be sure to check out the Getting Started guide, as it explains how to create a Gruntfile as well as install and use Grunt plugins. Once you're familiar with that process, you may install this plugin with this command:

npm install grunt-scrape --save-dev

Once the plugin has been installed, it may be enabled inside your Gruntfile with this line of JavaScript:

grunt.loadNpmTasks('grunt-scrape');

scrape task

Run this task with the grunt scrape command.

Task targets, files and options may be specified according to the grunt Configuring tasks guide.

Usage

The minimal setup for grunt-scrape is a scrape task the fields src, dest and collections.

grunt.initConfig({
scrape: {
  mydata: {
    src: 'http://example.com',
    dest: 'tmp/data.json',
    collections: [{
      name: 'mydata',
      group: '#someid > .some-class > table tr',
      elements: {
        name: {
          query: '> td > a'
        },
        link: {
          query: '> td > a',
          attr: 'href'
        }
      }
    }]
  }
}

Options

Collections (collections {array(object)})

One or more collections to extract from the website. Each collection will produce an object containing the fields specified under the elements option.

Settings:

name {string}

The name of the collection

elements {array(object)}

Specifies the fields of data to extract. Options are:

query {string}

A jQuery selector that identifies the field within the page or current group.

attr {string} (optional)

Specifies the data to extract. If no attr is specified data will be extracted using the jQuery('.selector').text() method (will strip any html tags still contained in the node).

If you want to specifically extract html information use attr: 'html'

All common html-attributes are available, e.g., class, href, src, ...

filter {string|function} (optional)

Allows to define a regular expression to further restrict the data to be extracted from a certain node.

Also allows a callback function for custom filtering. Return null to exclude the element from being added to the resulting dataset.

format {string|function} (optional)

Specify a formatter to process the extracted data with. Presets are number (formats the data using Number(data)) or date (will create a new Date object).

Also allows a callback function for custom filtering. Return null to exclude the element from being added to the resulting dataset.

group {string}

A jQuery selector that identifies a grouped block of information on the page. All queries of a colletion are run against this group.

Should be used to differentiate individual items of information, e.g., a table row within a table.

If no group is specified items will be grouped by index (requires the same number of results for each element of the collections).

Source (src) and source parameters (params)

src {string|array(string)}

One or more urls of websites to scrape.

params {object} (optional)

Allows to specify request parameters. Should be used when trying to to scrape multiple websites in a task. Request parameters should be indicated in the src url using a colon, e.g., :id

src: 'http://example.com/items/:id?param=:param',
options: {
  id: [123, 456],
  param: ['prime','secondary']
}

If multiple options are provided all permutations of the options provided will be scraped, i.e.:

http://example.com/items/123?param=prime
http://example.com/items/123?param=secondary
http://example.com/items/456?param=prime
http://example.com/items/456?param=secondary

Destination (dest)

dest {string}

The path to the output file. Must specify a valid file ending, as the output file format will be determined by path's file ending.

dest: 'path/to/my/file.json'

The path will be assumed from the current working directory.

Supported file formats are for now:

.json
.xml
.csv (* only available for single collections)

Options (options)

Allows to specify request options for accessing the target website (e.g., proxy, auth). For available options and usage, see request.

Examples

Filter

...
<tr><td>Doe, John</td>...</tr>
<tr><td>Johnson, Jane</td>...</tr>
...

Assuming a format of Lastname, Firstname within a node you can extract firstname and lastname separately using the filter function

elements: {
  firstname: {
    query: '> td > a',
    filter: /,[ ]*(.*)/
  },
  lastname: {
    query: '> td > a',
    filter: /(.*)[ ]*,/
  }
}

will extract:

[{
  firstname: 'John',
  lastname: 'Doe',
},{
  firstname: 'Jane',
  lastname: 'Johnson',
}]

Format

...
<tr><td>Doe, John</td><td>24</td></tr>
<tr><td>Johnson, Jane</td><td>45</td></tr>
...
elements: {
  age: {
    query: '> td > td:nth-child(2)',
    format: 'number'
  }
}

format: 'number will convert the data extracted by the query to a number before adding it to the dataset:

[{
  age: 'John',
  lastname: 'Doe',
  age: 24
},{
  firstname: 'Jane',
  lastname: 'Johnson',
  age: 45
}]

License

grunt-scrape is licensed under the MIT License.