npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@jose.donas/web-scraper

v1.4.3

Published

A simple web scraper for node.js using promises and css selectors.

Downloads

1

Readme

Build Status Hex.pm

web-scraper

A simple web scraper for node.js using promises and css selectors.

How it works

All you need is define an json object with:

  • url: the web page url to be scraped (mandatory)
  • forEach: the html element(s) where the scraper should search (optional). If no specified, html will be used.
  • get: a object specifying what data we want to get back from each forEach element.

Both forEach element and get values must be css selectors.

Briefly: for each forEach element found in the web DOM, web-scraper will return an object with the same structure as the get param containing the data in the corresponding property. Let's see some examples.

How to use it

Since web-scraper returns a Promise, given a correct search object (see examples below) you can choose between:

  • then/catch
const scraper = require('@jose.donas/web-scraper');
var mySearch = {...};

scraper(mySearch)
        .then( data => {
            //**data** is the returned info by **scraper**. Do whatever you need with it
        })
        .catch( err => {
            //Some error ocurred. Handle it!!
        });
  • async/await

Cleaner option, but remember that you can only use await inside an async function:

const scraper = require('@jose.donas/web-scraper');
var mySearch = {...};

(async() => {
    try{
        let data = await scraper(mySearch);
        //do whatever you need with **data**
    }catch(err){
        //Some error ocurred. Handle it!!
    }
})();

Examples

In this example, we are fetching some data from The Matrix page at IMDB

1. Get the film name, available in the web title

Our search params will be:

var mySearch = {
  url : 'https://www.imdb.com/title/tt0133093/',
  get:{
    filmTitle: 'div#ratingWidget p strong'
  }
} ;

this will return:

{
    "filmTitle": "Matrix"
}

Notice that the result will be an array only if the search result contains more than one element. This way you will need to check wether the result is a single element or a set of them. We'll see it in some following examples, but let's continue with The Matrix.

2. Get the film name with a different search params.

For every search, we can find several ways to get the desired data. In this case, we can get the film title with this configuration:

var mySearch = {
  url : 'https://www.imdb.com/title/tt0133093/',
  get:{
    filmTitle: 'strong'
  },
  forEach:'div#ratingWidget p'
} ;

With this configuration we are requesting our scraper to search for a strong element inside every p that is children of div#ratingWidget. But in the inspected DOM, this only ocurrs one time, so again, we will get:

{
    "filmTitle": "Matrix"
}

3. Get all the cast inside a single object.

Since there is a div identified with titleCast containing all the cast, we can get all the names with:

var mySearch = {
  url : 'https://www.imdb.com/title/tt0133093/',
  get:{
    names: 'span[itemprop="name"]'
  },
  forEach:'div#titleCast'
} ;

In this case, as there is just one element identified with 'div#titleCast', we are getting just one object containing an array of names, one for each actor/actress:

 { names:
     [ 'Keanu Reeves',
       'Laurence Fishburne',
       'Carrie-Anne Moss',
       'Hugo Weaving',
       'Gloria Foster',
       'Joe Pantoliano',
       'Marcus Chong',
       'Julian Arahanga',
       'Matt Doran',
       'Belinda McClory',
       'Anthony Ray Parker',
       'Paul Goddard',
       'Robert Taylor',
       'David Aston',
       'Marc Aden Gray' 
      ] 
  } 

4. Get one object for each actor and actress, containing the character's name too.

What if we wanted to get a set of objects, containg each one the name of the actor/actress and the name of the played character?

var mySearch = {
  url : 'https://www.imdb.com/title/tt0133093/',
  get:{
    name: 'span[itemprop="name"]',
    character: 'td.character'
  },
  forEach:'div#titleCast table tbody tr.even, div#titleCast table tbody tr.odd'
} ;

Will return us:

[
    {
        "name": "Keanu Reeves",
        "character": "Neo"
    },
    {
        "name": "Laurence Fishburne",
        "character": "Morpheus"
    },
    {
        "name": "Carrie-Anne Moss",
        "character": "Trinity"
    },
    {
        "name": "Hugo Weaving",
        "character": "Agent Smith"
    },
    {
        "name": "Gloria Foster",
        "character": "Oracle"
    },
    {
        "name": "Joe Pantoliano",
        "character": "Cypher"
    },
    {
        "name": "Marcus Chong",
        "character": "Tank"
    },
    {
        "name": "Julian Arahanga",
        "character": "Apoc"
    },
    {
        "name": "Matt Doran",
        "character": "Mouse"
    },
    {
        "name": "Belinda McClory",
        "character": "Switch"
    },
    {
        "name": "Anthony Ray Parker",
        "character": "Dozer"
    },
    {
        "name": "Paul Goddard",
        "character": "Agent Brown"
    },
    {
        "name": "Robert Taylor",
        "character": "Agent Jones"
    },
    {
        "name": "David Aston",
        "character": "Rhineheart"
    },
    {
        "name": "Marc Aden Gray",
        "character": "Choi (as Marc Gray)"
    }
]

5. Gathering links

Links ("a" elements) can be considered quite special because they contain a couple of elements we may want to store: the href attribute and the anchor text. That's why when we ask our scraper to gather a elements, it'll return us both pieces of data. An example where we are interested in a set of links:

var mySearch  = {
  url: 'https://www.imdb.com/title/tt0133093/',
  get:{
    linkToPerson: 'td[itemprop="actor"] a'
  },
  forEach: 'div#titleCast table tbody tr'
};

Will return us:

[
    {
        "linkToPerson": {
            "anchorText": "Keanu Reeves",
            "href": "/name/nm0000206/?ref_=tt_cl_t1"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Laurence Fishburne",
            "href": "/name/nm0000401/?ref_=tt_cl_t2"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Carrie-Anne Moss",
            "href": "/name/nm0005251/?ref_=tt_cl_t3"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Hugo Weaving",
            "href": "/name/nm0915989/?ref_=tt_cl_t4"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Gloria Foster",
            "href": "/name/nm0287825/?ref_=tt_cl_t5"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Joe Pantoliano",
            "href": "/name/nm0001592/?ref_=tt_cl_t6"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Marcus Chong",
            "href": "/name/nm0159059/?ref_=tt_cl_t7"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Julian Arahanga",
            "href": "/name/nm0032810/?ref_=tt_cl_t8"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Matt Doran",
            "href": "/name/nm0233391/?ref_=tt_cl_t9"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Belinda McClory",
            "href": "/name/nm0565883/?ref_=tt_cl_t10"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Anthony Ray Parker",
            "href": "/name/nm0662562/?ref_=tt_cl_t11"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Paul Goddard",
            "href": "/name/nm0323822/?ref_=tt_cl_t12"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Robert Taylor",
            "href": "/name/nm0853079/?ref_=tt_cl_t13"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "David Aston",
            "href": "/name/nm0040058/?ref_=tt_cl_t14"
        }
    },
    {
        "linkToPerson": {
            "anchorText": "Marc Aden Gray",
            "href": "/name/nm0336802/?ref_=tt_cl_t15"
        }
    }
]

6. Grouping data

Sometimes we may need to group the desired data. Let's see an example: in this case we are gathering players from a random NBA game

We could make a request to our scraper with this simple params:

var mySearch  = {
  url: 'https://www.basketball-reference.com/boxscores/200911060CHA.html',
  get:{
    playerName: 'th[csk]',
    points:'td[data-stat="pts"]'
  },
  forEach: 'table#box_atl_basic tbody tr, table#box_cha_basic tbody tr'
};

This is what we get:

[
    {
        "playerName": "Joe Johnson",
        "points": "13"
    },
    {
        "playerName": "Marvin Williams",
        "points": "9"
    },
    {
        "playerName": "Al Horford",
        "points": "10"
    },
    {
        "playerName": "Josh Smith",
        "points": "13"
    },
    {
        "playerName": "Mike Bibby",
        "points": "2"
    },
    {
        "playerName": "Jamal Crawford",
        "points": "13"
    },
    {
        "playerName": "Maurice Evans",
        "points": "10"
    },
    {
        "playerName": "Jeff Teague",
        "points": "4"
    },
    {
        "playerName": "Zaza Pachulia",
        "points": "8"
    },
    {
        "playerName": "Joe Smith",
        "points": "0"
    },
    {
        "playerName": "Randolph Morris",
        "points": "1"
    },
    {
        "playerName": "Jason Collins",
        "points": "0"
    },
    {
        "playerName": "Gerald Wallace",
        "points": "11"
    },
    {
        "playerName": "Boris Diaw",
        "points": "10"
    },
    {
        "playerName": "Raja Bell",
        "points": "24"
    },
    {
        "playerName": "Tyson Chandler",
        "points": "10"
    },
    {
        "playerName": "Raymond Felton",
        "points": "7"
    },
    {
        "playerName": "D.J. Augustin",
        "points": "11"
    },
    {
        "playerName": "Ronald Murray",
        "points": "15"
    },
    {
        "playerName": "Nazr Mohammed",
        "points": "8"
    },
    {
        "playerName": "Stephen Graham",
        "points": "2"
    },
    {
        "playerName": "Vladimir Radmanovic",
        "points": "2"
    },
    {
        "playerName": "Derrick Brown",
        "points": "2"
    },
    {
        "playerName": "Gerald Henderson",
        "points": "1"
    }
]

Yes, this way we'll get all the players with their respective points, but: what team did they play for?

To get than info grouped by team, as each roster has it own table, we just need to pass an array instead a comma separated list of elements. In this example, note the transformation at 'forEach' field:

var mySearch  = {
  url: 'https://www.basketball-reference.com/boxscores/200911060CHA.html',
  get:{
    playerName: 'th[csk]',
    points:'td[data-stat="pts"]'
  },
  forEach: ['table#box_atl_basic tbody tr', 'table#box_cha_basic tbody tr']
};

Now, our scraper will search separetely all the players games and points inside each of the elements (in this case, tables) for those players, so at the end we'll get separated stats by teams, where each array position contains each team players and stats:

[
    [
        {
            "playerName": "Joe Johnson",
            "points": "13"
        },
        {
            "playerName": "Marvin Williams",
            "points": "9"
        },
        {
            "playerName": "Al Horford",
            "points": "10"
        },
        {
            "playerName": "Josh Smith",
            "points": "13"
        },
        {
            "playerName": "Mike Bibby",
            "points": "2"
        },
        {
            "playerName": "Jamal Crawford",
            "points": "13"
        },
        {
            "playerName": "Maurice Evans",
            "points": "10"
        },
        {
            "playerName": "Jeff Teague",
            "points": "4"
        },
        {
            "playerName": "Zaza Pachulia",
            "points": "8"
        },
        {
            "playerName": "Joe Smith",
            "points": "0"
        },
        {
            "playerName": "Randolph Morris",
            "points": "1"
        },
        {
            "playerName": "Jason Collins",
            "points": "0"
        }
    ],
    [
        {
            "playerName": "Gerald Wallace",
            "points": "11"
        },
        {
            "playerName": "Boris Diaw",
            "points": "10"
        },
        {
            "playerName": "Raja Bell",
            "points": "24"
        },
        {
            "playerName": "Tyson Chandler",
            "points": "10"
        },
        {
            "playerName": "Raymond Felton",
            "points": "7"
        },
        {
            "playerName": "D.J. Augustin",
            "points": "11"
        },
        {
            "playerName": "Ronald Murray",
            "points": "15"
        },
        {
            "playerName": "Nazr Mohammed",
            "points": "8"
        },
        {
            "playerName": "Stephen Graham",
            "points": "2"
        },
        {
            "playerName": "Vladimir Radmanovic",
            "points": "2"
        },
        {
            "playerName": "Derrick Brown",
            "points": "2"
        },
        {
            "playerName": "Gerald Henderson",
            "points": "1"
        }
    ]
]

Check out the test folder to find more examples.

Tests

You can run the tests executing

npm test

from console (after you set your current directory to the project root). After tests execution, you'll have access to a coverage report both at the terminal and coverage folder (that will be created automatically) which contains an html report (simply double click on index.html to access a more detailed report).

Installation

Installation is quite easy using npm:

npm i @jose.donas/web-scraper

You can get some extra info about this module at web-scraper npm web page

Author

  • Jose Antonio González Doñas - LinkedIn

License

This project is licensed under the Apache 2.0 License - see the LICENSE.md file for details