awesome-scraper
v0.1.5
Published
website scraper
Downloads
2
Readme
Scraper
Node.js based scraper using headless chrome
Installation
$ npm install @jonstuebe/scraper
Features
- Scrape top ecommerce sites (Amazon, Walmart, Target, BestBuy)
- Return basic product information (title, price, image, description)
- Easy to use API
API
Simply require the package and initialize with a url and pass a callback function to receive the data.
es5
const Scraper = require("@jonstuebe/scraper");
// run inside of an async function
(async () => {
const data = await Scraper.scrapeAndDetect("http://www.amazon.com/gp/product/B00X4WHP5E/");
console.log(data);
})();
es6
import Scraper from "@jonstuebe/scraper";
// run inside of an async function
(async () => {
const data = await Scraper('http://www.amazon.com/gp/product/B00X4WHP5E/');
console.log(data);
})();
with promises
import Scraper from "@jonstuebe/scraper";
Scraper('http://www.amazon.com/gp/product/B00X4WHP5E/').then(data => {
console.log(data)
});
custom scrapers
const Scraper = require("@jonstuebe/scraper");
(async () => {
const site = {
name: "npm",
hosts: ["www.npmjs.com"],
scrape: async page => {
const name = await Scraper.getText("div.content-column > h1 > a", page);
const version = await Scraper.getText(
"div.sidebar > ul:nth-child(2) > li:nth-child(2) > strong",
page
);
const author = await Scraper.getText(
"div.sidebar > ul:nth-child(2) > li.last-publisher > a > span",
page
);
return {
name,
version,
author
};
}
};
const data = await Scraper.scrape(
"https://www.npmjs.com/package/lodash",
site
);
console.log(data);
})();
Todos
- Need to add ability to run a test to see if markup has changed, and if so disable the store selectors and fallback to the generic scraper.
Contributing
If you want to add any sites, or just have an idea or feature, go ahead and fork this repo and send me a pull request. I'll be happy to take a look when I can and get back to you.
Issues
For any and all issues/bugs, please post a description and code sample to reproduce the problem on the issues page.