html-miner
v4.0.0
Published
A powerful miner that will scrape html pages for you. ` HTML Scraper ´
Downloads
16
Maintainers
Readme
HTML Miner
A powerful miner that will scrape html pages for you.
Install
# using npm
npm i --save html-miner
# using yarn
yarn add html-miner
Example
I decided to collect common use cases inside a dedicated EXAMPLE.md. Feel free to start from Usage section or jump directly to Example page.
If you want to experiment, an online playground is also available.
:green_book: Enjoy your reading
Usage
Arguments
html-miner
accepts two arguments: html
and selector
.
const htmlMiner = require('html-miner');
// htmlMiner(html, selector);
HTML
html is a string and contains html
code.
let html = '<div class="title">Hello <span>Marco</span>!</div>';
SELECTOR
selector could be:
STRING
htmlMiner(html, '.title');
//=> Hello Marco!
If the selector extracts more elements, the result is an array:
let htmlWithDivs = '<div>Element 1</div><div>Element 2</div>';
htmlMiner(htmlWithDivs, 'div');
//=> ['Element 1', 'Element 2']
FUNCTION
Read function in detail paragraph.
htmlMiner(html, () => 'Hello everyone!');
//=> Hello everyone!
htmlMiner(html, function () {
return 'Hello everyone!'
});
//=> Hello everyone!
ARRAY
htmlMiner(html, ['.title', 'span']);
//=> ['Hello Marco!', 'Marco']
OBJECT
htmlMiner(html, {
title: '.title',
who: 'span'
});
//=> {
// title: 'Hello Marco!',
// who: 'Marco'
// }
You can combine array
and object
with each other or with string and functions.
htmlMiner(html, {
title: '.title',
who: '.title span',
upper: (arg) => { return arg.scopeData.who.toUpperCase(); }
});
//=> {
// title: 'Hello Marco!',
// who: 'Marco',
// upper: 'MARCO'
// }
Function in detail
A function
accepts only one argument that is an object
containing:
$
: is a jQuery-like function pointing to the document ( html argument ). You can use it to query and fetch elements from the html.htmlMiner(html, arg => arg.$('.title').text()); //=> Hello Marco!
$scope
: useful when combined with_each_
or_container_
(read special keys paragraph).htmlMiner(html, { title: '.title', spanList: { _each_: 'span', value: (arg) => { // "arg.$scope.find('.title')" doesn't exist. return arg.$scope.text(); } } }); //=> { // title: 'Hello Marco!', // spanList: [{ // value: 'Marco' // }] // }
globalData
: is an object that contains all previously fetched datas.htmlMiner(html, { title: '.title', spanList: { _each_: '.title span', pageTitle: function(arg) { // "arg.globalData.who" is undefined because defined later. return arg.globalData.title; } }, who: '.title span' }); //=> { // title: 'Hello Marco!', // spanList: [{ // pageTitle: 'Hello Marco!' // }], // who: 'Marco' // }
scopeData
: similar toglobalData
, but only contains scope data. Useful when combined with_each_
(read special keys paragraph).htmlMiner(html, { title: '.title', upper: (arg) => { return arg.scopeData.title.toUpperCase(); }, sublist: { who: '.title span', upper: (arg) => { // "arg.scopeData.title" is undefined because "title" is out of scope. return arg.scopeData.who.toUpperCase(); }, } }); //=> { // title: 'Hello Marco!', // upper: 'HELLO MARCO!', // sublist: { // who: 'Marco', // upper: 'MARCO' // } // }
Special keys
When selector is an object
, you can use special keys:
_each_
: creates a list of items. HTML Miner will iterate for the value and will parse siblings keys.{ articles: { _each_: '.articles .article', title: 'h2', content: 'p', } }
_eachId_
: useful when combined with_each_
. Instead of creating an Array, it creates an Object where keys are the result of_eachId_
function.{ articles: { _each_: '.articles .article', _eachId_: function(arg) { return arg.$scope.data('id'); } title: 'h2', content: 'p', } }
_container_
: uses the parsed value as container. HTML Miner will parse siblings keys, searching them inside the container.{ footer: { _container_: 'footer', copyright: (arg) => { return arg.$scope.text().trim(); }, company: 'span' // find only 'span' inside 'footer'. } }
For more details see the following example.
Let's try this out
Consider the following html snippet: we will try and fetch some information.
<h1>Hello, <span>world</span>!</h1>
<div class="articles">
<div class="article" data-id="a001">
<h2>Heading 1</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<div class="article" data-id="a002">
<h2>Heading 2</h2>
<p>Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.</p>
</div>
<div class="article" data-id="a003">
<h2>Heading 3</h2>
<p>Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.</p>
</div>
</div>
<footer>
<p>© <span>Company</span> 2017</p>
</footer>
const htmlMiner = require('html-miner');
let json = htmlMiner(html, {
title: 'h1',
who: 'h1 span',
h2: 'h2',
articlesArray: {
_each_: '.articles .article',
title: 'h2',
content: 'p',
},
articlesObject: {
_each_: '.articles .article',
_eachId_: function(arg) {
return arg.$scope.data('id');
},
title: 'h2',
content: 'p',
},
footer: {
_container_: 'footer',
copyright: (arg) => { return arg.$scope.text().trim(); },
company: 'span',
year: (arg) => { return arg.scopeData.copyright.match(/[0-9]+/)[0]; },
},
greet: () => { return 'Hi!'; }
});
console.log( json );
//=> {
// title: 'Hello, world!',
// who: 'world',
// h2: ['Heading 1', 'Heading 2', 'Heading 3'],
// articlesArray: [
// {
// title: 'Heading 1',
// content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
// },
// {
// title: 'Heading 2',
// content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',
// },
// {
// title: 'Heading 3',
// content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',
// }
// ],
// articlesObject: {
// 'a001': {
// title: 'Heading 1',
// content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
// },
// 'a002': {
// title: 'Heading 2',
// content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',
// },
// 'a003': {
// title: 'Heading 3',
// content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',
// }
// },
// footer: {
// copyright: '© Company 2017',
// company: 'Company',
// year: '2017'
// },
// greet: 'Hi!'
// }
You can find other examples under the folder /examples
# you can test examples with nodejs
node examples/demo.js
node examples/site.js
Development
npm install
npm test
# start the playground locally
npm start