contract-scraper
v6.0.0
Published
A customisable data scraper for the web based on JSON contracts
Downloads
20
Readme
contract-scraper
With contract-scraper you can easily scrape a HTML page and return the data in a structured format.
Installation
npm install contract-scraper --save
yarn add contract-scraper
Usage
To scrape a page, you can create a new instance of contract-scraper
with these parameters:
let contract = {
itemSelector: 'li',
puppeteer: true,
attributes: {
name: {
type: 'text',
selector: '.name'
},
link: {
type: 'link',
selector: 'a',
attribute: 'href'
}
}
}
const puppeteerOptions = {
headless: false,
}
const scraper = new Scraper('http://website.com', contract, puppeteerOptions)
A scraper can be initialised with custom puppeteer launch options.
A contract accepts the following properties:
itemSelector
(string)
A CSS selector for the element to be scraped. The scraper will process all the elements matching this selector.
puppeteer
(boolean)
If set to true contract-scraper will use Puppeteer to load and scrape the page contents
waitForPageLoadSelector
(string)
Puppeteer will wait for this CSS selector to exist in the DOM before scraping the page. Must be used in conjunction with pupeeteer: true
attributes
(object)
Defines the data to scrape for each item.
Each attribute matches a HTML element to scrape. The attribute type will define how data wil be extracted from the element, and how the data should be formatted in the final output. For example you can use one of the in-built types to extract a number from an element:
<ul>
<li>
<div class="name">Iron man</div>
<div class="price">100 euros</div>
</li>
<li>
<div class="name">Captain America</div>
<div class="price">500 euros</div>
</li>
<ul></ul>
</ul>
const contract = {
itemSelector: 'li',
attributes: {
name: {
type: 'text',
selector: '.name',
},
price: {
type: 'number',
selector: '.price',
},
},
};
const scraper = new Scraper('http://characters.com', contract);
scraper.scrapePage().then(items => {
console.log(items);
// [
// {
// name: 'Iron man',
// price: 100
// },
// {
// name: 'Captain America',
// price: 500
// }
// ]
});
Each attribute can have the following properties:
name
(string) - A label for this attribute for the final outputselector
(string) - The CSS selector for the element (scoped to itemSelector).type
(string) - A custom type, or one of the in-built ones that returns:background-image
: A background-image url from a style stringlink
: An absolute URLnumber
: A numbersize
: A number for size in m².text
: Inner text of the element
attribute (optional)
(string)The name of the HTML attribute to scrape data from. E.g. for an element:
<a href="http://linktoscrape">Homepage</a>
{ name: 'URL', type: 'link', selector: 'a', attribute: 'href' }
By default the attribute type will use the innerText of the element if
attribute
is not specified.data (optional)
(object) - If you want to scrape HTML data attributes you can do it in two ways:- Directly scraping a data attribute:
<div data-country="Australia"></div>
This will return "Australia" in your list of results.{ name: 'Country', type: 'text', selector: 'data-country', data: { name: 'country' } }
- For scraping a JSON value inside a data attribute:
<div data-price="{currency: 'aud'}"></div>
This will return "aud" in your list of results.{ name: 'Price', type: 'number', selector: 'data-price', data: { name: 'price', key: 'currency'} }
- Directly scraping a data attribute:
Nested attributes
It's also possible to scrape nested attributes, like a list inside an item:
<ul class="friends">
<li>
<span>Spiderman</span>
<ul>
<li><strong>Iron</strong><em>Man</em></li>
<li><strong>Captain</strong><em>America</em></li>
</ul>
</li>
</ul>
The contract:
{
"itemSelector": ".friends li",
"attributes": {
"name": { "type": "text", "selector": "span" },
"friends": {
"itemSelector": "ul li",
"attributes": {
"firstName": { "type": "text", "selector": "strong" },
"lastName": { "type": "text", "selector": "em" }
}
}
}
}
So this will return all the friends
as an array (using any type):
[
{
name: 'Spiderman',
friends: [
{ firstName: 'Iron', lastName: 'Man' },
{ firstName: 'Captain', lastName: 'America' },
],
},
];
Custom attributes types
In addition to the in-built attribute types, you can provide your own when you create a new instance of the scraper. A custom attribute type needs to be a class or a function that has a value
property. As a constructor argument it will receive the string innerText value from the matching element. Then you can format it however you like.
For example if you wanted to extract a list of tags and format them as an array:
<ul>
<li>
<div class="name">Australia</div>
<div class="tags">spiders,vegemite,scorching,heat</div>
</li>
</ul>
import Scraper from 'contract-scraper';
const contract = {
itemSelector: 'li',
attributes: {
countryName: {
type: 'text',
selector: '.name',
},
tags: {
type: 'list',
selector: '.tags',
},
},
};
function ListFromString(commaSeparatedString) {
return commaSeparatedString.split(',');
}
const scraper = new Scraper('http://countries.com', contract, {
list: ListFromString,
});
scraper.scrapePage().then(items => {
console.log(items);
// [
// {
// countryName: 'Australia',
// tags: [ 'spiders', 'vegemite', 'scorching', 'heat' ]
// }
// ]
});
Parsing JSON inside script tags
Sometimes you may want to extract values from inside a script tag on the page. For the moment, contract-scraper
only supports parsing JSON. For example:
<html>
<head>
<title>Page with a script tag</title>
</head>
<body>
<script type="application/ld+json" id="info">
{
"characters": [
{
"name": "Jon Snow",
"friends": [
{ "firstName": "Sansa", "lastName": "Stark" },
{ "firstName": "Bran", "lastName": "Stark" },
{ "firstName": "Arya", "lastName": "Stark" }
],
"photo": "http://images.com/jonsnow",
"price": {
"amount": "12345 dollars"
}
},
{
"name": "Ned Stark",
"friends": [
{ "firstName": "Sansa", "lastName": "Stark" },
{ "firstName": "Bobby", "lastName": "B" },
{ "firstName": "Little", "lastName": "finger" }
],
"photo": "http://images.com/nedstark",
"price": {
"amount": "6789 euros"
}
}
]
}
</script>
</body>
</html>
const contract = {
scriptTagSelector: '#info',
itemSelector: 'characters',
attributes: {
name: { type: 'text', selector: 'name' },
friends: {
itemSelector: 'friends',
attributes: {
firstName: { type: 'text', selector: 'firstName' },
lastName: { type: 'text', selector: 'lastName' },
},
},
photo: { type: 'link', selector: 'photo' },
price: { type: 'number', selector: 'price.amount' },
},
};
const scraper = new Scraper('http://characters.com', contract);
scraper.scrapePage().then(items => {
console.log(items);
// [
// {
// "name": "Jon Snow",
// "friends": [
// {
// "firstName": "Sansa",
// "lastName": "Stark"
// },
// {
// "firstName": "Bran",
// "lastName": "Stark"
// },
// {
// "firstName": "Arya",
// "lastName": "Stark"
// }
// ],
// "photo": "http://images.com/jonsnow",
// "price": 12345
// },
// {
// "name": "Ned Stark",
// "friends": [
// {
// "firstName": "Sansa",
// "lastName": "Stark"
// },
// {
// "firstName": "Bobby",
// "lastName": "B"
// },
// {
// "firstName": "Little",
// "lastName": "finger"
// }
// ],
// "photo": "http://images.com/nedstark",
// "price": 6789
// }
// ]
});