cheers2
v0.4.2
Published
Scrape a website efficiently, block by block, page by page. Based on cheerio and cURL.
Downloads
9
Maintainers
Readme
Cheers
Scrape a website efficiently, block by block, page by page.
Motivations
This is a Cheerio based scraper, useful to extract data from a website using CSS selectors. The motivation behind this package is to provide a simple cheerio-based scraping tool, able to divide a website into blocks, and transform each block into a JSON object using CSS selectors.
Built on top of the excellents :
https://github.com/cheeriojs/cheerio https://github.com/chriso/curlrequest https://github.com/kriskowal/q
CSS mapping syntax inspired by :
https://github.com/dharmafly/noodle
Getting Started
Install the module with: npm install cheers
Usage
Configuration options:
config.url
: the URL to scrapeconfig.blockSelector
: the CSS selector to apply on the page to divide it in scraping blocks. This field is optional (will use "body" by default)config.scrape
: the definition of what you want to extract in each block. Each key has two mandatory attributes :selector
(a CSS selector or.
to stay on the current node) andextract
. The possible values forextract
are text, html, outerHTML, a RegExp or the name of an attribute of the html element (e.g. "href")
Roadmap
- Option to use request instead of curl
- Option to change the user agent
- Command line tool
- Website pagination
- Option to use a headless browser
- Unit tests
Contributors
- https://github.com/fallanic
- https://github.com/arsalan-k
- https://github.com/kchapelier
Cheers!
License
Copyright (c) 2014 Fabien Allanic
Licensed under the MIT license.