npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

very-simple-scraper

v0.2.0

Published

A very simple scraper

Downloads

2

Readme

Very Simple Scraper

Overview

This repo contains a scraper and parser that both work on a specified input csv file. First you scrape the rows in the file. Then you parse the scraped response associated with each row. The parse command generates a parse session folder. In this folder you will find files associated with extracted data, missing data, and new inputs.

Initial setup

npm i very-simple-scraper
const scraper = require('very-simple-scraper').scraper;
const domains = require("./domains");

const formProxyUrl = (urlToScrape, apiKey) =>
  `https://someproxyservice/?key=${apiKey}&url=${urlToScrape}`;

scraper(domains, formProxyUrl)

Inputs

The input csv file requires that you have a domain, kind, and id header.

Make sure the csv file has the headers present as the first line of the file

domain,kind,id
wikipedia,homepage,''
wikipedia,wiki,Ever_Given
wikipedia,wiki,COVID-19_pandemic

Optionally, you may also provide a originDomain, originKind, originId, and context.

You can use the filter option below to filter on specific contexts.

When you generate parse session output, you will often see a inputs.csv file in the folder. This file contains inputs found when parsing whatever data you ran it over. It is common to use this new inputs.csv as a input for a brand new scrape and parse session.

Scraping

Data is scraped from a domain and saved in a local html cache folder. Data in this folder can than be parsed with the other command.

Args

You must specify: token and input command line arguments.

  • token : the ScrapingBee token
  • input : the input csv file. A header with url must exist in this file.

You may specify: from, to, filter, and parallelize command line argument.

  • from : start at the from row number in the input CSV. Ex: --from=10
  • to : end at the to row number in the input CSV. Ex: --to=90
  • filter: Filter by a domain kind's context. Ex: --filter="wiki link". Filters can use the union operator. For example: --filter="wiki link|reference link"
  • parallelize: run the scraper in Tmux in N parallel sessions. Each session is a window named after the chunk it is processing. Ex: --parallelize=3 will run 3 sessions. If there are 9 rows, the windows will be named 0-2, 3-5, and 6-8.

Example

node run.js scrape --token=MYTOKEN --input=input.csv --to=5

Parsing

Args

You must specify: input and output command line arguments.

  • input : the input csv file. A header with url must exist in this file.
  • session : the output session name. Output files will be written to scraped/output/<session name>/*

You may specify: from, to and filter command line argument.

  • from : start at the from row number in the input CSV. Ex: --from=10
  • to : end at the to row number in the input CSV. Ex: --to=90
  • filter: Filter by a domain kind's context. Ex: --filter="wiki link". Filters can use the union operator. For example: --filter="wiki link|reference link"

Examples

Parse the fist 500 rows of a CSV

node run.js parse --input=input.csv --session=wiki_data_apr_1 --to=500

Parse the first 10 rows and filter for rows that have the context tag media

node run.js parse --input=scraped/output/out2/inputs.csv --session=out3 --to=10 --filter="reference link"

Scrape rows 20 to 100 across 10 sessions in parallel

node run.js scrape --token=TKM0D87U1XR98JD0F74RMULE7GMLYDVY2O --input=input.csv --from=20 --to=100 --parallelize=10

Scrape everything in the input csv across 10 sessions in parallel

node run.js scrape --token=TKM0D87U1XR98JD0F74RMULE7GMLYDVY2O --input=input.csv --parallelize=10