x-ray-scraper
v3.0.6
Published
Scraper next gen based on x-ray (2.3.2)
Downloads
2,691
Maintainers
Readme
Why a fork
I wanted to use x-ray in a react-native projet but when I looked to the code I thought that a first refactoring was necessary. I took some liberties along the way and made some changes to the original api, it's why I created a new package called "x-ray-scraper".
Main differences with the original x-ray:
- instatiation of x-ray is different,
- all the callbacks were replaced with promises (except for the crawler),
- javascript up to date (use of es6+ features),
- "x-ray-crawler" and "x-ray-parse" npm module are now included as a part of the project,
- mocha was replaced with jest.
It's not yet compatible with react-native but could be in the near future.
const x = require('x-ray-scraper');
x('https://blog.ycombinator.com/', '.post', [{
title: 'h1 a',
link: '.article-title@href'
}])
.paginate('.nav-previous a@href')
.limit(3)
.write('results.json')
Installation
npm install x-ray-scraper
Features
Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.
Composable: The API is entirely composable, giving you great flexibility in how you scrape each page.
Pagination support: Paginate through websites, scraping each page. X-ray also supports a request
delay
and a paginationlimit
. Scraped pages can be streamed to a file, so if there's an error on one page, you won't lose what you've already scraped.Crawler support: Start on one page and move to the next easily. The flow is predictable, following a breadth-first crawl through each of the pages.
Responsible: X-ray has support for concurrency, throttles, delays, timeouts and limits to help you scrape any page responsibly.
Pluggable drivers: Swap in different scrapers depending on your needs.
Selector API
xray(url, selector)
Scrape the url
for the following selector
, returning an object via a promise.
The selector
takes an enhanced jQuery-like string that is also able to select on attributes. The syntax for selecting on attributes is selector@attribute
. If you do not supply an attribute, the default is selecting the innerText
.
Here are a few examples:
- Scrape a single tag
x('http://google.com', 'title')
.then(title) {
console.log(title); // Google
})
OR
const title = await x('http://google.com', 'title');
console.log(title);
- Scrape a single class
x('http://reddit.com', '.content')
.then(console.log)
.catch(console.error)
- Scrape an attribute
x('http://techcrunch.com', 'img.logo@src')
.then(console.log)
.catch(console.error)
- Scrape
innerHTML
x('http://news.ycombinator.com', 'body@html')
.then(console.log)
.catch(console.error)
xray(url, scope, selector)
You can also supply a scope
to each selector
. In jQuery, this would look something like this: $(scope).find(selector)
.
xray(html, scope, selector)
Instead of a url, you can also supply raw HTML and all the same semantics apply.
const html = "<body><h2>Pear</h2></body>";
x(html, 'body', 'h2')
.then(header => console.log(header))
.catch(console.error)
API
xray.driver(driver)
Specify a driver
to make requests through. Available drivers include:
- request - A simple driver built around request. Use this to set headers, cookies or http methods.
- phantom - A high-level browser automation library. Use this to render pages or when elements need to be interacted with, or when elements are created dynamically using javascript (e.g.: Ajax-calls).
xray.stream()
Returns Readable Stream of the data. This makes it easy to build APIs around x-ray. Here's an example with Express:
const app = require('express')();
const x = require('x-ray-scraper')();
app.get('/', (req, res) => {
const stream = x('http://google.com', 'title').stream();
stream.pipe(res);
})
xray.write([path])
Stream the results to a path
.
If no path is provided, then the behavior is the same as .stream().
xray.then(cb)
Constructs a Promise
object and invoke its then
function with a callback cb
. Be sure to invoke then()
at the last step of xray method chaining.
x('https://dribbble.com', 'li.group', [{
title: '.dribbble-img strong',
image: '.dribbble-img [data-src]@data-src',
}])
.paginate('.next_page@href')
.limit(3)
.then(res => {
console.log(res[0]) // prints first result
})
.catch(err => {
console.log(err) // handle error in promise
})
xray.paginate(selector)
Select a url
from a selector
and visit that page.
Also accept a function as argument.
The selector
function receives two arguments:
pageNumber
: The next page number (first call with value 2)$
: Cheerio object if you want to select stuffs to compute the next url.
x('https://blog.ycombinator.com/', '.post', [
{
title: 'h1 a',
link: '.article-title@href'
}
])
.paginate((pageNumber, $) => `https://blog.ycombinator.com/page/${pageNumber}/`)
.limit(3);
.then(res => {
console.log(res[0]) // prints first result
})
.catch(err => {
console.log(err) // handle error in promise
})
xray.limit(n)
Limit the amount of pagination to n
requests.
xray.abort(validator)
Abort pagination if validator
function returns true
.
The validator
function receives two arguments:
result
: The scrape result object for the current page.nextUrl
: The URL of the next page to scrape.
xray.delay(from, [to])
Delay the next request between from
and to
milliseconds.
If only from
is specified, delay exactly from
milliseconds.
xray.concurrency(n)
Set the request concurrency to n
. Defaults to Infinity
.
xray.throttle(n, ms)
Throttle the requests to n
requests per ms
milliseconds.
xray.timeout (ms)
Specify a timeout of ms
milliseconds for each request.
Collections
X-ray also has support for selecting collections of tags. While x('ul', 'li')
will only select the first list item in an unordered list, x('ul', ['li'])
will select all of them.
Additionally, X-ray supports "collections of collections" allowing you to smartly select all list items in all lists with a command like this: x(['ul'], ['li'])
.
Composition
X-ray becomes more powerful when you start composing instances together. Here are a few possibilities:
Crawling to another site
const x = require('x-ray-scraper');
x('http://google.com', {
main: 'title',
image: x('#gbar a@href', 'title'), // follow link to google images
})
.then(obj => {
/*
{
main: 'Google',
image: 'Google Images'
}
*/
});
Scoping a selection
const x = require('x-ray-scraper');
x('http://mat.io', {
title: 'title',
items: x('.item', [{
title: '.item-content h2',
description: '.item-content section'
}])
)
.then(obj => {
/*
{
title: 'mat.io',
items: [
{
title: 'The 100 Best Children\'s Books of All Time',
description: 'Relive your childhood with TIME\'s list...'
}
]
}
*/
})
Filters
Filters can be specified with setFilters method. To apply filters to a value, append them to the selector using |
.
const x = require('x-ray-scraper');
x.setFilters({
trim: function (value) {
return typeof value === 'string' ? value.trim() : value
},
reverse: function (value) {
return typeof value === 'string' ? value.split('').reverse().join('') : value
},
slice: function (value, start , end) {
return typeof value === 'string' ? value.slice(start, end) : value
}
});
x('http://mat.io', {
title: 'title | trim | reverse | slice:2,3'
})
.then(obj => {
/*
{
title: 'oi'
}
*/
})
Create multiple instances
Sometimes, you may need to instantiate multiple instances of xray.
const Xray = require('x-ray-scraper/Xray');
const xOne = Xray();
// Optionnal parameter: driver
const xTwo = Xray(/* driver */);
xOne('http://reddit.com', '.content')
.then(console.log)
.catch(console.error);
xTwo('http://reddit.com', '.content')
.then(console.log)
.catch(console.error);
Examples
- selector: simple string selector
- collections: selects an object
- arrays: selects an array
- collections of collections: selects an array of objects
- array of arrays: selects an array of arrays
Credits
- Big thanks to Matthew Mueller and all the contibutors of the original module.
License
MIT