npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

hammer-scrape

v1.3.9

Published

Unifies Cheerio and Puppeteer for the most streamline scraping experience

Downloads

3

Readme

Hammer Scrape

Let's be honest here, web scraping isn't an art form. It's literally hitting the internet with a hammer. Do yourself a favor, if you can use a site provided REST API to automate actions or gather your data, then do so. Your code base will thank you.

BUT WAIT I DON'T HAVE THAT LUXURY!

Great news, you've found a great library to simplify the process of parsing and automating actions on web pages then. Thanks to Cheerio and Puppeteer I've developed a streamlined library to take much of that pain away.


Installing

npm install hammer-scrape

How does it work?

This library works by breaking the work down by using what I call "cores" to handle the boilerplate and abstract away much of the functionality needed. From there it's up to the core to implement using whatever method they want to handle things such as querying the document or by manipulating the web page. Currently you can find three cores inside this library

  • CheerioParsing
  • PuppeteerParsing
  • PuppeteerManipulate

Cores are extensible and you can develop your own using other frameworks such as Nightmare if that is your preferred. Each of these cores power what I call an "engine". An engine is made up of a parsing core and a manipulating core if possible. In this library you will find three different engines

  • CheerioEngine: Uses cheerio for parsing, but has no manipulation compabilities
  • PuppeteerEngine: parsing and page manipulation done by Puppeteer
  • HammerEngine: uses both cheerio and puppeteer to provide a fast parser but with page maniuplation methods via puppeteer

More often then not the most used engine will be the HammerEngine, followed by the cheerio engine. I've included puppeteer just for the sake of completness. However everything PuppeteerEngine can do, HammerEngine can do. Anything CheerioEngine can do, HammerEngine can do. But you do have a choice if you have a preference regardless, and like the cores, you can always extend WebScrapingEngine to implement your own.

Why use Hammer over PuppeteerEngine

The key is how hammer works, the HammerEngine implementation will first attempt to use Cheerio to parse the document and look for what I call a "peek/ping selector". If it can find this selector it will use cheerio to parse out the document and when its time to manipulate the page it will create a puppeteer request. This form of "lazy loading" so to speak makes the startup time much easier and lighter on resources additionally. If the selector is not found, a puppeteer instance will be launched and shared between the cores.


Example using Hammer

import HammerEngine from 'hammer-scrape';

function main(): Promise<void> {
    return new Promise(
        async (resolve): Promise<void> => {
            console.time('Hammer benchmark');
            console.log('Starting up engine');
            let engine: HammerEngine = new HammerEngine('table.files tr.js-navigation-item td.content span a');
            await engine.startup();

            console.log('Now processing hammer scrape repository');
            await engine.process('https://github.com/GabrieleNunez/hammer-scrape');

            // our goal is to scrape the file names from this repository
            let files: string[] = [];

            // parse the engine and grab the data
            console.log('Parsing page');
            await engine.parse(
                (core): Promise<void> => {
                    return new Promise(
                        async (resolve): Promise<void> => {
                            files = await core.getTextAll('table.files tr.js-navigation-item td.content span a');
                            resolve();
                        },
                    );
                },
            );

            console.log('Top directory files');
            console.log(files);

            console.log('Shutting engine off');
            await engine.shutoff();
            console.timeEnd('Hammer benchmark');
            resolve();
        },
    );
}

main().then((): void => {
    console.log('Completed');
});

Can Hammer Scrape parse websites that are dynamic?

YES! This portion is powered by Puppeteer. if the peek/ping selector cannot be found a Puppeteer instance will be created thatway you can use a headless browser to interface with the site. All element manipulations are doing using puppeteer. If you are using the CheerioEngine you cannot manipulate a page, thats the nature of cheerio. Its a parser not a manipulation method. Puppeteer can handle both

You support Cheerio, can you load xml documents in?

The capability is there, but the implementation is not hooked in at this moment. This will come very soon in the next minor build