yolo-scraper
v1.0.1
Published
A simple way to structure your web scraper.
Downloads
37
Maintainers
Readme
A simple way to structure your web scraper.
- Define the request.
- Extract the data from the response.
- Validate the data against JSON Schema.
install
Using NPM:
npm install yolo-scraper --save
usage
Define your scraper function.
var yoloScraper = require('yolo-scraper');
var scraper = yoloScraper.createScraper({
request: function (username) {
return 'https://www.npmjs.com/~' + username.toLowerCase();
},
extract: function (response, body, $) {
return $('.collaborated-packages li').toArray().map(function (element) {
var $element = $(element);
return {
name: $element.find('a').text(),
url: $element.find('a').attr('href'),
version: $element.find('strong').text()
};
});
},
schema: {
"$schema": "http://json-schema.org/draft-04/schema#",
"type" : "array",
"items": {
"type": "object",
"additionalProperties": false,
"properties": {
"name": { "type": "string" },
"url": { "type": "string", "format": "uri" },
"version": { "type": "string", "pattern": "^v\\d+\\.\\d+\\.\\d+$" }
},
"required": [ "name", "url", "version" ]
}
}
});
Then use it.
scraper('masterT')
.then(function (data) {
console.log(data)
})
.catch(function (error) {
console.error(error)
})
documentation
ValidationError
Error instance with additional Object property errorObjects
which content all the error information, see ajv error.
createScraper(options)
Returned a scraper function defined by the options
.
var yoloScraper = require('yolo-scraper');
var options = {
// ...
};
var scraper = yoloScraper.createScraper(options);
The scraper function returns a Promise
that resolves with the valid extract data or rejects with an Error
.
scraper(params)
.then(function (data) {
console.log(data)
})
.catch(function (error) {
console.error(error)
})
options.paramsSchema
The JSON schema that defines the shape of the accepted arguments passed to options.request
. When invalid, an Error will be thrown.
Optional
options.request = function(params)
Function that takes the arguments passed to your scraper function and returns the options to pass to the axios module to make the network request.
Required
options.extract = function(response, body, $)
Function that takes axios response, the response body (String) and a cheerio instance. It returns the extracted data you want.
Required
options.schema
The JSON schema that defines the shape of your extracted data. When your data is invalid, an Error with the validation message will be passed to your scraper callback.
Required
options.cheerioOptions
The option to pass to cheerio when it loads the request body.
Optional, default: {}
options.ajvOptions
The option to pass to ajv when it compiles the JSON schemas.
Optional, default: {allErrors: true}
- It check all rules collecting all errors
dependecies
- axios - Promise based HTTP client for the browser and node.js.
- cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
- ajv - The fastest JSON Schema Validator. Supports draft-04/06/07.
dev dependecies
- jasmine - Simple JavaScript testing framework for browsers and node.js.
- nock HTTP server mocking and expectations library for Node.js.
test
npm test
license
MIT