npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

puppet-scrap

v1.1.0

Published

Puppet Scrap is a Node.js application designed to scrape data from web pages using Puppeteer and perform JSON Path queries on a dataset. This tool allows you to extract specific elements from a JSON dataset by executing a scraping script on the web pages.

Downloads

2

Readme

Puppet Scrap

Puppet Scrap is a Node.js application designed to scrape data from web pages using Puppeteer and perform JSON Path queries on a dataset. This tool allows you to extract specific elements from a JSON dataset by executing a scraping script on the web pages. It takes the following inputs:

  • Dataset in JSON format: The dataset contains all the necessary data needed for scraping, such as URLs or other relevant information. This JSON dataset serves as the data source for the scraping process.

  • JSON Path Query: A JSON Path query is provided, which selects specific items from the dataset. These selected items represent the data points for which the scraping process will be executed. The JSON Path query allows users to specify the exact data elements they want to scrape from the dataset.

  • Scraping Script: Users provide a JavaScript file (script) that performs the actual scraping task. The scraping script receives two arguments: a reference to the browser page (Puppeteer.js) and the dataset. It is responsible for navigating to the specified URLs, extracting the required data from the web pages, and updating the dataset with the scraped information. For Example:

    export default async function (page, dataset) {
      await page.goto(dataset.url);
      await page.waitForSelector('li')
    
      dataset.products = await page.evaluate(e => {
        const allElements = Array.from(document.querySelectorAll('li a'))
        return allElements
          .map(e => ({name: e.innerHTML, url: e.href}))
      })
      return dataset
    }

Usage

To display Puppet Scrap usage instruction run

puppet-scrap.js --help

How it works

  • It parses the command-line arguments provided by the user, such as the dataset file path, scraping script file path, JSON Path query, output file path, and other optional parameters like delay, limit, and pretty output.
  • If there is a progress file stored (.${projectName}.progress.json), indicating previous scraping progress, it will be loaded to resume the scraping process.
  • The tool reads the dataset from the specified JSON file and parses it.
  • The scraping script is loaded from the provided file path. This script will be used to extract data from web pages.
  • A JSON Path query is executed on the dataset based on the user-provided query. This query identifies the data points for which the scraping script will be applied.
  • The headless Chrome browser is launched using Puppeteer.
  • For each data point obtained from the JSON Path query and within the specified limit, the scraping process is executed.
  • The progress is updated, and the tool stores the current progress in a progress file (.${projectName}.progress.json).
  • After scraping all the data points or reaching the specified limit, the browser is closed, and the progress file is deleted if the scraping process is complete.

Example

Using Puppet Scrap with the provided example:

puppet-scrap.js --dataset ./data/products_1.json --script ./scripts/list.js --query '$[*]' --output ./data/products_2.json

This will execute Puppet Scrap on the ./data/products_1.json dataset file. It will apply the scraping script located at ./scripts/list.js to each data point selected by the JSON Path query $[*]. The scraped information will be updated in the dataset, and the final dataset will be saved to ./data/products_2.json

To demonstrate how Puppet Scrap works, we have provided an example folder containing a demo that you can run. The demo includes a dummy website that simulates a product catalog. Here's how you can run the example:

  • Start the dummy website by running the following command at the root of the Puppet Scrap project:
      npm run example
  • This will start the product catalog at http://localhost:3456
  • Go to the example folder using the following command and run the scrap.sh script to execute Puppet Scrap on the example dataset:
      cd ./example
      ./scrap.sh
  • Puppet Scrap will use the provided dataset in JSON format, perform the multi-step scraping, and update the dataset with the scraped information. The output will be saved to './data' folder. See source of scrap.sh for more details

Important Notes

  • Before running Puppet Scrap, ensure you have Node.js installed on your system.
  • Always be respectful of websites' terms of service and consider adding delays between requests to avoid overloading servers.
  • Make sure your scraping activities comply with legal and ethical guidelines, respecting the website owners' policies.
  • Test your scripts thoroughly and be prepared to handle various edge cases and errors gracefully.

With Puppet Scrap's power and flexibility, you can easily scrape data from dynamic web pages and use it for various purposes, from data analysis to building datasets for machine learning models. Enjoy exploring the endless possibilities!