npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

html-data-scraper

v1.1.1

Published

An efficient wrapper around puppeteer for data scraping web pages.

Downloads

17

Readme

An efficient wrapper around puppeteer for data scraping web pages.

Install

yarn add html-data-scraper

Or

npm install html-data-scraper

API

htmlDataScraper(urls, configurations, customBrowser)

  • urls: string[] An array of urls
  • configurations?: CustomConfigurations An CustomConfigurations object to configure everything.
  • customBrowser?: Browser An instance of Browser created outside the library, this instance will also be given back in browserInstance.
  • returns Promise<{results:PageResult[], browserInstance: Browser}> Promise which resolves an array of PageResult objects and the Browser instance used during the process.

This main function will distribute the scraping process regarding all minus one cpu cores number available ( if the computer have 4 cores, it will distribute on 3 cores ). The distribution mean opening a page for each available core.

Usage

import htmlDataScraper, {PageResult} from 'html-data-scraper';    

const progress: Record<string, string[]> = {};

const urls = [];
const urlNumber = 7;
const maxSimultaneousBrowser = 3;

for (let i = 0; i < urlNumber; i++) {
    urls.push('https://fr.wikipedia.org/wiki/World_Wide_Web');
}

htmlDataScraper(urls, {
    maxSimultaneousBrowser,
    onEvaluateForEachUrl: {
        title: (): string => {
            const titleElement: HTMLElement | null = document.getElementById('firstHeading');
            const innerElement: HTMLCollectionOf<HTMLElement> | null = titleElement.getElementsByTagName('span');
            return innerElement && innerElement.length ? innerElement[0].innerText : '';
        },
    },
    onProgress: (resultNumber: number, totalNumber: number, internalPageIndex: number) => {
        console.log('Scraping page n°',internalPageIndex, '->' , resultNumber + '/' + totalNumber);
    },
})
    .then(({results}: {results:PageResult[], browserInstance: Browser}) => {

        console.log(results);
        // [
        //    {
        //        pageData: '<!DOCTYPE html>.....',
        //        evaluates: { 
        //            title: 'World Wide Web'
        //        }
        //    },
        //    ...
        // ]


    });

CustomConfigurations

This object is use to setup the scraping process and puppeteer itself. The following object show the default values :

const configurations = {
    maxSimultaneousBrowser  : 1,
    additionalWaitSeconds   : 1,
    puppeteerOptions        : {
        browser : {
            args : [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage',
            ],
        },
        pageGoTo : { waitUntil: 'networkidle2' },
    },
}

if maxSimultaneousBrowser is not set during the initialization, it will use all available core minus one :

if (!customConfigurations.hasOwnProperty('maxSimultaneousBrowser')){
    const cpuCoreCount = os.cpus().length;
    configuration.maxSimultaneousBrowser = cpuCoreCount > 2 ? cpuCoreCount - 1 : 1;
}

Additionally you can use the following keys:

onPageRequest

onPageRequest: (request) => void

Whenever the page sends a request, such as for a network resource, the following function is triggered.

onPageLoadedForEachUrl

onPageLoadedForEachUrl: (puppeteerPage, currentUrl) => {}

  • puppeteerPage: Page A reference to the current puppeteer Page.
  • currentUrl: string The current url of the page.
  • return any You can return whatever you need.

The returned value is set in PageResult.pageData for each url. If this function is not set, by default PageResult.pageData is set with page.content().

onEvaluateForEachUrl

onEvaluateForEachUrl: {}

This key contain an object that register functions. Those functions are use by page.evaluate. Each function return is set to the corresponding name in PageResult.evaluates :

import htmlDataScraper, {PageResult} from 'html-data-scraper';    

htmlDataScraper([
    'https://www.bbc.com',
], {
    onEvaluateForEachUrl: {
        title: (): string => {
            const titleElement: HTMLElement | null = document.getElementById('page-title');
            return titleElement ? titleElement.innerText : '';
        },
    },
})
.then((results: {results:PageResult[], browserInstance: Browser}) => {
    console.log(results[0]);
    // {
    //     pageData: "....",
    //     evaluates: {
    //         title: "..." 
    //     } 
    // }
});
onProgress

onProgress: (resultNumber, totalNumber, internalPageIndex) => {}:

  • resultNumber: number The number of processed urls.
  • totalNumber: number The total number of urls to process.
  • internalPageIndex: number An index representing the page that process urls.
  • return void No return needed.

This function is run each time a page processing finish.

import htmlDataScraper from 'html-data-scraper';    

const progress: Record<string,string[]> = {};

htmlDataScraper([
    // ...
], {
    onProgress: (resultNumber: number, totalNumber: number, internalPageIndex: number) => {
        const status = resultNumber + '/' + totalNumber;
        if (progress[internalPageIndex]){
            progress[internalPageIndex].push(status);
        } else {
            progress[internalPageIndex] = [status];
        }
    },
})

PageResult

interface PageResult{
    pageData: any;
    evaluates: null | {
        [k: string]: any;
    };
}
pageData

This key will contain the html content of the webpage. But if you set the onPageLoadedForEachUrl, this will contain de returned value on the function.

evaluates

This key contain the result of onEvaluateForEachUrl functions.

Development

Setup

  1. Clone this repository
  2. yarn install

Run tests

yarn test

Author

👤 Ravidhu Dissanayake

Show your support

Give a ⭐️ if this project helped you!