html-data-scraper
v1.1.1
Published
An efficient wrapper around puppeteer for data scraping web pages.
Downloads
7
Maintainers
Readme
An efficient wrapper around puppeteer for data scraping web pages.
Install
yarn add html-data-scraper
Or
npm install html-data-scraper
API
htmlDataScraper(urls, configurations, customBrowser)
urls: string[]
An array of urlsconfigurations?: CustomConfigurations
An CustomConfigurations object to configure everything.customBrowser?: Browser
An instance of Browser created outside the library, this instance will also be given back inbrowserInstance
.- returns
Promise<{results:PageResult[], browserInstance: Browser}>
Promise which resolves an array of PageResult objects and the Browser instance used during the process.
This main function will distribute the scraping process regarding all minus one cpu cores number available ( if the computer have 4 cores, it will distribute on 3 cores ). The distribution mean opening a page for each available core.
Usage
import htmlDataScraper, {PageResult} from 'html-data-scraper';
const progress: Record<string, string[]> = {};
const urls = [];
const urlNumber = 7;
const maxSimultaneousBrowser = 3;
for (let i = 0; i < urlNumber; i++) {
urls.push('https://fr.wikipedia.org/wiki/World_Wide_Web');
}
htmlDataScraper(urls, {
maxSimultaneousBrowser,
onEvaluateForEachUrl: {
title: (): string => {
const titleElement: HTMLElement | null = document.getElementById('firstHeading');
const innerElement: HTMLCollectionOf<HTMLElement> | null = titleElement.getElementsByTagName('span');
return innerElement && innerElement.length ? innerElement[0].innerText : '';
},
},
onProgress: (resultNumber: number, totalNumber: number, internalPageIndex: number) => {
console.log('Scraping page n°',internalPageIndex, '->' , resultNumber + '/' + totalNumber);
},
})
.then(({results}: {results:PageResult[], browserInstance: Browser}) => {
console.log(results);
// [
// {
// pageData: '<!DOCTYPE html>.....',
// evaluates: {
// title: 'World Wide Web'
// }
// },
// ...
// ]
});
CustomConfigurations
This object is use to setup the scraping process and puppeteer itself. The following object show the default values :
const configurations = {
maxSimultaneousBrowser : 1,
additionalWaitSeconds : 1,
puppeteerOptions : {
browser : {
args : [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
],
},
pageGoTo : { waitUntil: 'networkidle2' },
},
}
if maxSimultaneousBrowser
is not set during the initialization, it will use all available core minus one :
if (!customConfigurations.hasOwnProperty('maxSimultaneousBrowser')){
const cpuCoreCount = os.cpus().length;
configuration.maxSimultaneousBrowser = cpuCoreCount > 2 ? cpuCoreCount - 1 : 1;
}
Additionally you can use the following keys:
onPageRequest
onPageRequest: (request) => void
request: Request
Represents a page HTTPRequest.
Whenever the page sends a request, such as for a network resource, the following function is triggered.
onPageLoadedForEachUrl
onPageLoadedForEachUrl: (puppeteerPage, currentUrl) => {}
puppeteerPage: Page
A reference to the current puppeteer Page.currentUrl: string
The current url of the page.- return
any
You can return whatever you need.
The returned value is set in PageResult.pageData for each url.
If this function is not set, by default PageResult.pageData is set with page.content()
.
onEvaluateForEachUrl
onEvaluateForEachUrl: {}
This key contain an object that register functions. Those functions are use by page.evaluate. Each function return is set to the corresponding name in PageResult.evaluates :
import htmlDataScraper, {PageResult} from 'html-data-scraper';
htmlDataScraper([
'https://www.bbc.com',
], {
onEvaluateForEachUrl: {
title: (): string => {
const titleElement: HTMLElement | null = document.getElementById('page-title');
return titleElement ? titleElement.innerText : '';
},
},
})
.then((results: {results:PageResult[], browserInstance: Browser}) => {
console.log(results[0]);
// {
// pageData: "....",
// evaluates: {
// title: "..."
// }
// }
});
onProgress
onProgress: (resultNumber, totalNumber, internalPageIndex) => {}
:
resultNumber: number
The number of processed urls.totalNumber: number
The total number of urls to process.internalPageIndex: number
An index representing the page that process urls.- return
void
No return needed.
This function is run each time a page processing finish.
import htmlDataScraper from 'html-data-scraper';
const progress: Record<string,string[]> = {};
htmlDataScraper([
// ...
], {
onProgress: (resultNumber: number, totalNumber: number, internalPageIndex: number) => {
const status = resultNumber + '/' + totalNumber;
if (progress[internalPageIndex]){
progress[internalPageIndex].push(status);
} else {
progress[internalPageIndex] = [status];
}
},
})
PageResult
interface PageResult{
pageData: any;
evaluates: null | {
[k: string]: any;
};
}
pageData
This key will contain the html content of the webpage. But if you set the onPageLoadedForEachUrl
, this will contain de returned value on the function.
evaluates
This key contain the result of onEvaluateForEachUrl
functions.
Development
Setup
- Clone this repository
yarn install
Run tests
yarn test
Author
👤 Ravidhu Dissanayake
Show your support
Give a ⭐️ if this project helped you!