@riteable/scraper

v1.0.2

Published

5 months ago

A basic website scraper.

Downloads

0High
0Medium
0Low

riteable

scraper crawler cheerio

Scraper

A basic website scraper.

Usage

A simple example:

const Scraper = require('@riteable/scraper')

async function run () {
  const scraper = new Scraper()

  scraper
    .setIndexUrl('https://example.com')
    .setLinkSelector('.article .title a')

  return scraper.fetchPages()
}

run()
  .then(console.log)
  .catch(console.error)

The above example would output something like the following:

[
  {
    title: 'Some article',
    description: 'A description of the article.',
    image: 'https://example.com/path/to/an/image.jpg',
    url: 'https://example.com/some-article'
  }
]

An instance of Scraper will try to extract the above data by default.

If you need to extract more data, or don't need the above, you can use the setDataMap() method to specify what you need:

scraper.setDataMap({
  ...scraper.helpers,
  publishedAt: ({ $ }) => $('meta[property="article:published_time"]').attr('content')
})

The helpers have certain fallbacks built-in to look for data. See helpers.js for the implementations.

API

The following properties and methods are available:

helpers: This property contains helper functions to extract commonly needed data. Currently implemented:

title()
description()
image()
url()

setIndexUrl(url): Set the URL of a page which contains a list of articles/pages that you want to scrape.

setLinkSelector(selector): Set the selector of the <a> elements which link to the pages to be scraped. This module uses cheerio for parsing and traversing documents.

setDataMap(object): Determine how data should be extracted and mapped to fields. The object only accepts callback functions as values. The callbacks receive an object parameter, which contains a document parsed by cheerio aliased as $, so you can easily query data within the document. The rest of the parameter object contains a needle response from the requested page.

setThrottle(object): You can throttle requests with a delay and concurrent setting. For example:

scraper.setThrottle({
  delay: 500, // milliseconds between requests
  concurrent: 1 // amount of requests at a time
})

async fetchIndex(): Parse data only from the index URL, determined by setIndexUrl().

async fetchPages(): Extract data from linked pages, found by setting setLinkSelector().

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

@riteable/scraper

v1.0.2

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Scraper

Usage

API