npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

web-scraper-js

v1.1.0

Published

A lightweight and simple to use web scraping library.

Downloads

7

Readme

web-scraper-js

A lightweight, no BS, simple to use web scraping library written in node, which simply does its job, nothing more, nothing less.

Disclaimer

Please make sure to use this package within legal and ethical boundaries.

Install

npm install --save web-scraper-js

Usage

You'll have to specifiy if the data you want to scrape is (rendered) text or attribute values.

scrape([params])

This is the only functionality this package provides.

|parameter |type |description |required | |:----------------|:------:|:-----------------------------------------------------|:-------:| |url |string|The url from which to scrape from |true | |tags |object|An object, yielding the information on what to scrape |true | |tags.text |object|Text elements to be scraped |false | |tags.attribute |object|Attributes of elements to be scraped |false | |tags.singleton |object|Category of content that only occurs once |false | |tags.collection|object|Category of content that can occur multiple times |false |

In order to successfully scrape something you'll have to provide selectors. Since you're reading this I assume you know what that is and just go on.

The tags.text and tags.attribute objects take different key value pairs.

tags.text

This object takes key value pairs of the form: {'name': 'selector'}, where the key is a name of your choice. However, it should have a meaningful name, since you will be accessing it later in the response. Also as of now you should not declare the same names in tags.text and tags.attribute since one will overwrite the other.

The selector part should be obvious, typically browsers allow you to just copy them using their dev tools.

tags.attribute

This object takes key value pairs of the form: {'name': ['selector', 'attribute']}.

This time the value is a tuple containing the selector and the attribute from which to collect data. Since an element can have multiple values you'll have to declare which one to use, simple as that .

tags.collection and tags.singleton

These are objects to provide more meaning to the search, where singleton tells the code that these items should only occur once, whereas collection means there might be multiple entries of the same structure.

Both contain objects structured like the previous tags.text and tags.attribute objects, as you can see in the examples.

Examples

The following examples scrape a couple of details about the movie Pulp Fiction from IMDb.

(async () => {
    
    let result = await webscraper.scrape({
        url: 'https://www.imdb.com/title/tt0110912/',
        tags: {
            text: {
                "movie-rating-value": 'span[itemprop="ratingValue"]',
                "movie-character": ".character a"
            },
            attribute: {
                "movie-title": ["meta[property='og:title']", "content"],
                "movie-actor": [".primary_photo > a > img", "alt"]
            }
        }
    });

    console.log(result);
})();

The code above will print the follwing output:

{
  "movie-rating-value": [ "8.9" ],
  "movie-character": [
     "Pumpkin", "Honey Bunny", "Waitress", //...
   ],
  "movie-title": [ "Pulp Fiction (1994) - IMDb" ],
  "movie-actor": [
     "Tim Roth", "Amanda Plummer", "Laura Lovelace", //...
   ]
}

As you can see it's a simple object, using your declared names as keys and, respectively, the results of their selectors inside an array since there can be multiple results for one selector.

There also is a more semantically sensitive way to declare the contents you want to have scraped. With this method you declare if the respective elements should occure just once (singleton) or if there might be more than one elements containing the same sort of type.

let webscraper = require('web-scraper-js');

(async () => {
    
    let result = await webscraper.scrape({
        url: 'https://www.imdb.com/title/tt0110912/',
        tags: {
            singleton: {
                text: {
                    "movie-rating-value": 'span[itemprop="ratingValue"]'
                },
                attribute: {
                    "movie-title": ["meta[property='og:title']", "content"]
                }
            },
            collection: {
                text: {
                    "movie-character": ".character a"
                },
                attribute: {
                    "movie-actor": [".primary_photo > a > img", "alt"]
                }
            }
        }
    });

    console.log(result);

})();

For the elements declared as singleton, an object will be returned, an array for collection type elements, respectively.

{
  "movie-rating-value": "8.9",
  "movie-character": [
     "Pumpkin", "Honey Bunny", "Waitress", //...
   ],
  "movie-title": "Pulp Fiction (1994) - IMDb",
  "movie-actor": [
     "Tim Roth", "Amanda Plummer", "Laura Lovelace", //...
   ]
}