npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

paginated-listings-scraper

v2.3.3

Published

Extract listings data from paginated web pages

Downloads

10

Readme

DISCLAIMER

This documentation is for version 1. Version 2 has changed a lot and I'm afraid I havn't been able to update the documentation with the new features and changes so you are best off looking at the code. It should mostly work as before though hopefully. This is mainly a project for my own personal use. If you would like better documentation or encounter any errors please file and issue and I'll do my best to help you out.

Paginated Listings Scraper

Extract listings data from paginated web pages.

It uses Cheerio to access the DOM

If you are using Chrome you can get an accurate CSS selector for a given element quite easily. See this Stack Overflow answer

For debugging set the DEBUG=paginated-listings-scraper environment variable

Installation

npm i paginated-listings-scraper

Example usage

  import { scrapeListing } from 'paginated-listings-scraper';

  const options = {
    dataSelector: {
      text: '.text-block',
      title: 'h3',
    },
    filter: '.row.blank',
    maximumDepth: 3,
    nextPageSelector: 'a.next-page',
    parentSelector: '.row',
    terminate: (element, $) => element.find($('.bad-apple')).length,
    url: 'http://paginatedlisitings.com',
  };

  const data = await scrape(options);
// returns a promise
// data = [{ title: 'Old McDonald', text: 'Had a farm', } ...]

Options

url

The url of the page you wish to scrape. Ideally this should be a paginated page consisting of elements in a list format. It uses request-native-promise to fetch the page. See request

parentSelector

The CSS selector of the elements you wish to iterate over. Each element found matching this selector will be mapped using dataSelector to extract the specified data. See cheerio selectors, cheerio find and cheerio map

dataSelector

Used to extract data from the elements returned from parentSelector. It can be either a function or an object of keys in the form { name: cssSelector }. cssSelector can be a string or a function.

If an object is used it will iterate over each of its keys and extract the text contained within the element returned by the css selector. It will return each item as an object in the form { name: data }.

If a function is used it will receive the element currently being acted on as a cheerio element as well as the cheerio function created from the DOM as arguments which will allow you to select whatever data you need.

  //
  dataSelector(element, $) {
    return element.find($('#sweet.sweet.data')).text()
  }

See cheerio selectors and cheerio find

The returned value from this will be added to an array which will eventually be returned by the scraper

nextPageSelector

Gets the url of the next page to be scraped. Can be either a CSS selector or a function. If a selector is used it gets the href property of the element. If the href is not a valid url than it assumes it is a path and concatenates this with the origin of the url that was initially passed in as the url option

If you need something more custom then this then use a function. The function will receive the original Url and the loaded Cheerio DOM as an argument which will allow you to select whatever you want from the page.

  nextPageSelector({ $, url, depth }) {
    return `${origin}${$('a.hard-to-get').attr('data-hidden-href')}`
  }

This function should return a Url which will be used to request the next page to be scraped. See cheerio selectors and cheerio find

maximumDepth (optional if terminate function is provided)

The page number at which the scraper will stop. If set to 0 no pages will be scraped. Must be a number

terminate (optional if maximumDepth is provided)

A function that is run to determine whether or not to stop scraping. It is acted on each element returned by the parentSelector. It recieves the element currently being acted on as a cheerio element as well as the cheerio function created from the DOM as an arguments

  terminate(element, $) {
    return !!element.attr('data-important-confiential-stuff')
  }

Must return something truthy or falsey. See cheerio selectors

filter(optional)

Can be either a CSS selector or a function. It is used to filter out unwanted elements before the inital iteration takes place. See cheerio filter for explanation and example usage

shouldReturnDataOnError (optional - default = false)

States whether or not it should return the data its collected so far when it encounters an error while scraping a page. This will mean no error will be propagated so be careful.