very-simple-scraper

v0.2.0

Published

3 years ago

A very simple scraper

Downloads

0High
0Medium
0Low

sn-admin

Very Simple Scraper

Overview

This repo contains a scraper and parser that both work on a specified input csv file. First you scrape the rows in the file. Then you parse the scraped response associated with each row. The parse command generates a parse session folder. In this folder you will find files associated with extracted data, missing data, and new inputs.

Initial setup

npm i very-simple-scraper

const scraper = require('very-simple-scraper').scraper;
const domains = require("./domains");

const formProxyUrl = (urlToScrape, apiKey) =>
  `https://someproxyservice/?key=${apiKey}&url=${urlToScrape}`;

scraper(domains, formProxyUrl)

Inputs

The input csv file requires that you have a domain, kind, and id header.

Make sure the csv file has the headers present as the first line of the file

domain,kind,id
wikipedia,homepage,''
wikipedia,wiki,Ever_Given
wikipedia,wiki,COVID-19_pandemic

Optionally, you may also provide a originDomain, originKind, originId, and context.

You can use the filter option below to filter on specific contexts.

When you generate parse session output, you will often see a inputs.csv file in the folder. This file contains inputs found when parsing whatever data you ran it over. It is common to use this new inputs.csv as a input for a brand new scrape and parse session.

Scraping

Data is scraped from a domain and saved in a local html cache folder. Data in this folder can than be parsed with the other command.

Args

You must specify: token and input command line arguments.

token : the ScrapingBee token
input : the input csv file. A header with url must exist in this file.

You may specify: from, to, filter, and parallelize command line argument.

from : start at the from row number in the input CSV. Ex: --from=10
to : end at the to row number in the input CSV. Ex: --to=90
filter: Filter by a domain kind's context. Ex: --filter="wiki link". Filters can use the union operator. For example: --filter="wiki link|reference link"
parallelize: run the scraper in Tmux in N parallel sessions. Each session is a window named after the chunk it is processing. Ex: --parallelize=3 will run 3 sessions. If there are 9 rows, the windows will be named 0-2, 3-5, and 6-8.

Example

node run.js scrape --token=MYTOKEN --input=input.csv --to=5

Parsing

Args

You must specify: input and output command line arguments.

input : the input csv file. A header with url must exist in this file.
session : the output session name. Output files will be written to scraped/output/<session name>/*

You may specify: from, to and filter command line argument.

from : start at the from row number in the input CSV. Ex: --from=10
to : end at the to row number in the input CSV. Ex: --to=90
filter: Filter by a domain kind's context. Ex: --filter="wiki link". Filters can use the union operator. For example: --filter="wiki link|reference link"

Examples

Parse the fist 500 rows of a CSV

node run.js parse --input=input.csv --session=wiki_data_apr_1 --to=500

Parse the first 10 rows and filter for rows that have the context `tag media`

node run.js parse --input=scraped/output/out2/inputs.csv --session=out3 --to=10 --filter="reference link"

Scrape rows 20 to 100 across 10 sessions in parallel

node run.js scrape --token=TKM0D87U1XR98JD0F74RMULE7GMLYDVY2O --input=input.csv --from=20 --to=100 --parallelize=10

Scrape everything in the input csv across 10 sessions in parallel

node run.js scrape --token=TKM0D87U1XR98JD0F74RMULE7GMLYDVY2O --input=input.csv --parallelize=10

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Very Simple Scraper

Overview

Initial setup

Inputs

Scraping

Args

Example

Parsing

Args

Examples

Parse the fist 500 rows of a CSV

Parse the first 10 rows and filter for rows that have the context tag media

Scrape rows 20 to 100 across 10 sessions in parallel

Scrape everything in the input csv across 10 sessions in parallel

Parse the first 10 rows and filter for rows that have the context `tag media`