npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

mega-scraper

v16.8.1

Published

scrape a website's content.

Downloads

49

Readme

mega-scraper

scrape a website's content.

npm i -g mega-scraper

mega-scraper https://www.wikipedia.org

requirements

  • running redis instance on host 0.0.0.0 port 6379

  • on debian/ubuntu, install additional required libraries via sudo apt install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

api

see api.md for more usage example and options.

e.g.

(async () => {
  const {browser: {createBrowser, takeScreenshot}, queue: {createQueue}} = require('.')

  const browser = await createBrowser()
  const queue = createQueue('wikipedia')

  const page = await browser.newPage('https://www.wikipedia.org/')

  const url = 'https://www.wikipedia.org/'
  await queue.add({ url })

  queue.process(async (job) => {
    await page.goto(job.data.url)
    await takeScreenshot(page, job.data)
    const content = await page.content()
    console.log('content', content.substring(0, 500))
  })
})()

cli options

--headless [default: true]

set to false to run the scraper in "headful" mode (non-headless)

e.g.

mega-scraper https://www.wikipedia.org --headless false

--screenshot [default: true]

set to false to avoid taking a screenshot of each scraped page

e.g.

mega-scraper https://www.wikipedia.org --headless false

--proxy [default: true]

set to false to avoid proxying each request through a free proxy service (currently the module get-free-https-proxy is used)

e.g.

mega-scraper https://www.wikipedia.org --proxy false

--timeout [default: 5000]

set the timeout to a desired number in milliseconds (5000 = 5 seconds)

e.g.

mega-scraper https://www.wikipedia.org --timeout 10000

--images [default: true]

set to false to avoid loading images

e.g.

mega-scraper https://www.wikipedia.org --images false

--stylesheets [default: true]

set to false to avoid loading stylesheets

e.g.

mega-scraper https://www.wikipedia.org --stylesheets false

--javascript [default: true]

set to false to avoid loading javascript

e.g.

mega-scraper https://www.wikipedia.org --javascript false

--monitor [default: true]

set to false to avoid opening the web dashboard on localhost:4000

e.g.

mega-scraper https://www.wikipedia.org --monitor false

--exit [default: false]

set to true to exit the program with success or failure status code once done scraping.

e.g.

mega-scraper https://www.wikipedia.org --exit

--cookie [default: none]

set to a desired cookie to further prevent detection

e.g.

mega-scraper https://www.wikipedia.org --cookie 'my=cookie'