npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@satankebab/scraping-utils

v0.0.9

Published

Set of utils and queues to make web scraping easy.

Downloads

5

Readme

Scraping Utils

Set of utils and queues to make web scraping easy.

Features:

  • automatic retrying on errors
  • definining minimum and maximum time of a message to be processed
  • parallelism can be configured
  • automatic initializatio & clean up of puppeteer browser (and pages)

Installation

npm install @satankebab/scraping-utils;

Examples

With Puppeteer

import { Page, createQueue, PartialSubscriber, runPuppeteerQueue } from "@satankebab/scraping-utils"

const crawlSomething = async () => {
  // Define a shape of the data we want to store in our queue
  type Payload = {
    url: string
    attempt: number
  }
  
  // Create a queue
  const queue = createQueue<Payload>({
    parallelLimit: 2
  })
  
  // Add some initial items to the queue
  queue.enqueue({
    url: 'https://fsharpforfunandprofit.com/',
    attempt: 0,
  })
  queue.enqueue({
    url: 'https://github.com/gcanti/fp-ts',
    attempt: 0,
  })
  queue.enqueue({
    url: 'https://elm-lang.org/',
    attempt: 0,
  })
  queue.enqueue({
    url: 'https://github.com/Effect-TS/core',
    attempt: 0,
  })
  
  // Define our 'crawler' and the type of the payload with puppeteer page
  type PayloadWithPage = Payload & {
    page: Page,
  }
  const crawler: PartialSubscriber<PayloadWithPage, Payload> = {
    // For each payload, call this function
    next: async ({ page, url }) => {
      await page.goto(url);
      console.log('The first 100 chars from response:', (await page.content()).slice(0, 100))
      // You can call queue.enqueue here to add more items to crawl here, for example:
      if (url === 'https://github.com/Effect-TS/core') {
        queue.enqueue({
          url: 'https://github.com/Effect-TS/monocle',
          attempt: 0,
        })
      }
    },
    // When there is an error, we want to be notified about it along with the original payload that caused the error.
    error: (error, payload) => {
      console.error(`Upps, could not process ${payload.url}, error: `, error, `, attempt: ${payload.attempt}`)
    },
  }
  
  // Consume payloads from the queue and resolve when the queue is empty
  await runPuppeteerQueue({
    queue,
    crawler,
    // Max time in ms crawler's next function can be processing (timeouting crawling that takes too long)
    maxProcessingTime: 2 * 60 * 1000,
    // Min time in ms crawler's next function can be processing (slowing down crawling that is too fast)
    minProcessingTime: 5 * 1000,
    // How many numbers do we want to try the same payload until we call crawler's .error method
    retryAttempts: 2,
    // Options that are directly passed to puppeteer's .launch method
    puppeteerLaunchOptions: { headless: false } 
  })

  console.log('We are done!')
}


crawlSomething()

Without Puppeteer

import { createQueue, PartialSubscriber, runBasicQueue } from "@satankebab/scraping-utils"
import fetch from "node-fetch";

const crawlSomething = async () => {
  // Define a shape of the data we want to store in our queue
  type Payload = {
    url: string
    attempt: number
  }
  
  // Create a queue
  const queue = createQueue<Payload>({
    parallelLimit: 2
  })
  
  // Add some initial items to the queue
  queue.enqueue({
    url: 'https://fsharpforfunandprofit.com/',
    attempt: 0,
  })
  queue.enqueue({
    url: 'https://github.com/gcanti/fp-ts',
    attempt: 0,
  })
  queue.enqueue({
    url: 'https://elm-lang.org/',
    attempt: 0,
  })
  queue.enqueue({
    url: 'https://github.com/Effect-TS/core',
    attempt: 0,
  })
  
  // Define our 'crawler'
  const crawler: PartialSubscriber<PayloadWithPage, Payload> = {
    // For each payload, call this function
    next: async ({ url }) => {
      const response = await fetch(url);
      console.log('The first 100 chars from response:', (await response.text()).slice(0, 100))
      // You can call queue.enqueue here to add more items to crawl here, for example:
      if (url === 'https://github.com/Effect-TS/core') {
        queue.enqueue({
          url: 'https://github.com/Effect-TS/monocle',
          attempt: 0,
        })
      }
    },
    // When there is an error, we want to be notified about it along with the original payload that caused the error.
    error: (error, payload) => {
      console.error(`Upps, could not process ${payload.url}, error: `, error, `, attempt: ${payload.attempt}`)
    },
  }
  
  // Consume payloads from the queue and resolve when the queue is empty
  await runBasicQueue({
    queue,
    crawler,
    // Max time in ms crawler's next function can be processing (timeouting crawling that takes too long)
    maxProcessingTime: 2 * 60 * 1000,
    // Min time in ms crawler's next function can be processing (slowing down crawling that is too fast)
    minProcessingTime: 5 * 1000,
    // How many numbers do we want to try the same payload until we call crawler's .error method
    retryAttempts: 2,
  })

  console.log('We are done!')
}


crawlSomething()