selector-scraper
v1.0.13
Published
A javascript library for scraping data based on the css selectors.
Downloads
255
Readme
Selector Scraper
A flexible and generic web scraper for extracting data from websites using a selector-based approach. This package allows you to configure and customize your scraping logic with ease.
Installation
To install the package, use npm:
npm install selector-scraper
Usage
- Import the ScraperIn your JavaScript file
- import the SelectorScraper class:
import SelectorScrapper from 'selector-scraper';
Configure the Scraper
- Create an instance of the scraper with the necessary configuration. Here’s an example configuration:
const scraper = new SelectorScraper({
baseUrl: "https://example.com",
path: "/category/featured/",
page: {
maxRecords: 2,
traverse: {
selector: "#main li a", // Required
attribute: "href",
filter: undefined, // Optional (Regex to filter links)
prependBaseUrl: false // Optional (Default: false)
},
data: {},
page: {
maxRecords: 2,
traverse: undefined,
data: {
title: {
selector: "div.title > h2", //Required
attribute: null, //Optional
transform: uppercase
},
description: {
selector: "p.description",
attribute: null,
transform: null //Optional
},
link: {
selector: "div.entry-content a.mv_button_css",
attribute: "href",
transform: null
},
poster: {
selector: "img.poster",
attribute: "src",
transform: null
}
},
}
}
});
Details
- baseUrl: The base URL of the website to scrape.
- path: The path to the landing page or index page to start scraping.
- page.traverse: An object used to traverse through the subpages. If traverse is null or undefined, subpages will not be parsed.
- page.traverse.selector: CSS selector to extract item links for traversal.
- page.traverse.attribute: CSS selector to extract item links from an attribute (e.g., href) for traversal.
- page.traverse.filter: Regex to skip links during traversal. Applies to the inner text of the element.
- page.traverse.prependBaseUrl: A flag that tells the scraper whether to include the base URL with the traverse link or not.
- page.data: An object where keys are data fields and values are CSS selectors to extract data from item pages.
- page.data.selector: CSS selector to extract the element's value from the page.
- page.data.attribute: CSS selector to extract the element's attribute value from the page.
- page.data.transform: A callback function to transform the selector's value. Useful for applying conversions or string operations.
- page.maxRecords: Maximum number of records to fetch from each page. Applies only to the traverse selector.
- page.page: Configuration for nested pages. Scraped recursively.
Run the Scraper Call the scrape method to start scraping:
scraper.scrape()
.then(results => {
console.log('Scraping Results:', results);
})
.catch(error => {
console.error('Scraping Error:', error);
});
Example
Here is a complete example demonstrating how to use the scraper:
import SelectorScraper from 'selector-scraper';
const uppercase = value => value.toString().toUpperCase();
const config = {
baseUrl: "https://www.imdb.com",
path: "/title/tt4154796/",
page: {
maxRecords: 1,
traverse: {
selector: "#__next > main > div > section section.ipc-page-section.ipc-page-section--base.title-cast.title-cast--movie > div.ipc-title.ipc-title--base.ipc-title--section-title.ipc-title--on-textPrimary > div > a",
attribute: "href",
prependBaseUrl: true
},
data: {
title: {
selector: "h1 span",
transform: uppercase
}
},
page: {
data: {
cast: {
selector: "#fullcredits_content > table.cast_list > tbody > tr:nth-child(-n+10) > td:nth-child(2) > a",
}
},
}
}
}
const scraper = new SelectorScraper(config);
(async () => {
const result = await scraper.scrape();
console.log(JSON.stringify(result));
})();
Result
{"data":{"title":"AVENGERS: ENDGAME"},"child":[{"data":{"cast":["Robert Downey Jr.","Chris Evans","Mark Ruffalo","Chris Hemsworth","Scarlett Johansson","Jeremy Renner","Don Cheadle","Paul Rudd","Benedict Cumberbatch"]}}]}