selector-scraper

v1.0.13

Published

16 days ago

A javascript library for scraping data based on the css selectors.

Downloads

255

0High
0Medium
0Low

jarvisnexus

Selector Scraper

A flexible and generic web scraper for extracting data from websites using a selector-based approach. This package allows you to configure and customize your scraping logic with ease.

Installation

To install the package, use npm:

npm install selector-scraper

Usage

Import the ScraperIn your JavaScript file
import the SelectorScraper class:

import SelectorScrapper from 'selector-scraper';

Configure the Scraper

Create an instance of the scraper with the necessary configuration. Here’s an example configuration:

const scraper = new SelectorScraper({
    baseUrl: "https://example.com",
    path: "/category/featured/",
    page: {
        maxRecords: 2,
        traverse: {
            selector: "#main li a", // Required
            attribute: "href",
            filter: undefined, // Optional (Regex to filter links)
            prependBaseUrl: false // Optional (Default: false)
        },
        data: {},
        page: {
            maxRecords: 2,
            traverse: undefined,
            data: {
                title: {
                    selector: "div.title > h2", //Required
                    attribute: null, //Optional
                    transform: uppercase
                },
                description: {
                    selector: "p.description",
                    attribute: null,
                    transform: null //Optional
                },
                link: {
                    selector: "div.entry-content a.mv_button_css",
                    attribute: "href",
                    transform: null
                },
                poster: {
                    selector: "img.poster",
                    attribute: "src",
                    transform: null
                }
            },
        }
    }
});

Details

baseUrl: The base URL of the website to scrape.
path: The path to the landing page or index page to start scraping.
page.traverse: An object used to traverse through the subpages. If traverse is null or undefined, subpages will not be parsed.
- page.traverse.selector: CSS selector to extract item links for traversal.
- page.traverse.attribute: CSS selector to extract item links from an attribute (e.g., href) for traversal.
- page.traverse.filter: Regex to skip links during traversal. Applies to the inner text of the element.
- page.traverse.prependBaseUrl: A flag that tells the scraper whether to include the base URL with the traverse link or not.
page.data: An object where keys are data fields and values are CSS selectors to extract data from item pages.
- page.data.selector: CSS selector to extract the element's value from the page.
- page.data.attribute: CSS selector to extract the element's attribute value from the page.
- page.data.transform: A callback function to transform the selector's value. Useful for applying conversions or string operations.
page.maxRecords: Maximum number of records to fetch from each page. Applies only to the traverse selector.
page.page: Configuration for nested pages. Scraped recursively.

Run the Scraper Call the scrape method to start scraping:

scraper.scrape()
  .then(results => {
    console.log('Scraping Results:', results);
  })
  .catch(error => {
    console.error('Scraping Error:', error);
  });

Example

Here is a complete example demonstrating how to use the scraper:

import SelectorScraper from 'selector-scraper';

const uppercase = value => value.toString().toUpperCase();

const config = {
    baseUrl: "https://www.imdb.com",
    path: "/title/tt4154796/",
    page: {
        maxRecords: 1,
        traverse: {
            selector: "#__next > main > div > section section.ipc-page-section.ipc-page-section--base.title-cast.title-cast--movie > div.ipc-title.ipc-title--base.ipc-title--section-title.ipc-title--on-textPrimary > div > a",
            attribute: "href",
            prependBaseUrl: true
        },
        data: {
            title: {
                selector: "h1 span",
                transform: uppercase
            }
        },
        page: {
            data: {
                cast: {
                    selector: "#fullcredits_content > table.cast_list > tbody > tr:nth-child(-n+10) > td:nth-child(2) > a",
                }
            },
        }
    }
}

const scraper = new SelectorScraper(config);

(async () => {
    const result = await scraper.scrape();
    console.log(JSON.stringify(result));
})();

Result

{"data":{"title":"AVENGERS: ENDGAME"},"child":[{"data":{"cast":["Robert Downey Jr.","Chris Evans","Mark Ruffalo","Chris Hemsworth","Scarlett Johansson","Jeremy Renner","Don Cheadle","Paul Rudd","Benedict Cumberbatch"]}}]}

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme