npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@coya/web-scraper

v0.2.2

Published

Web scraper on top of PhantomJS or Chromium

Downloads

39

Readme

Web Scraper

Web scraper on top of PhantomJS or Chromium.
If you chose to use PhantomJS, the module is designed as a connection client/server between the PhantomJS web scraper server and a client acting like a driver and sending scraping HTTP requests to the server.
Chromium is different because it is driven directly from NodeJS.

Installation

npm install @coya/web-scraper

Build (for dev)

git clone https://github.com/Cooya/WebScraper
cd WebScraper
npm install // it will also install the development dependencies
npm install phantomjs -g // if you need PhantomJS, install it globally
npm run build
npm run example // run the example script in "examples" folder

Usage examples

The package allows to inject JS function :

const { ChromiumScraper } = require('@coya/web-scraper');

// if you want to use PhantomJS instead of Chromium
// const { PhantomScraper } = require('@coya/web-scraper');

const scraper = ChromiumScraper.getInstance();

const getLinks = function() { // return all links from the requested page
    return $('a').map(function(i, elt) {
        return $(elt).attr('href');
    }).get();
};

scraper.request({
    url: 'cooya.fr',
    fct: getLinks // function injected in the page environment
})
.then(function(result) {
    console.log(result); // returned value of the injected function
    scraper.close(); // end the client/server connection and kill the web scraper subprocess
}, function(error) {
    console.error(error);
    scraper.close();
});

Or to inject JS function from an external script :

const { ChromiumScraper } = require('@coya/web-scraper');

// if you want to use PhantomJS instead of Chromium
// const { PhantomScraper } = require('@coya/web-scraper');

const scraper = ChromiumScraper.getInstance();

scraper.request({
    url: 'cooya.fr',
    fct: __dirname + '/externalScript.js', // external script exporting the function to be injected
})
.then(function(result) {
    console.log(result); // returned value of the injected function
    scraper.close(); // end the client/server connection and kill the web scraper subprocess
}, function(error) {
    console.error(error);
    scraper.close();
});

externalScript.js :

module.exports = function() { // return all links from the requested page
    return $('a').map(function(i, elt) {
        return $(elt).attr('href');
    }).get();
};

Methods

ScraperClient.getInstance()

The ScraperClient object is a singleton, only one client can be created, so this method is required to get the client instance.

request(params)

Send a request to a specific url and inject JavaScript into the page associated. Return a promise with the result in parameter.

Parameter | Type | Description | Default value -------- | --- | --- | --- params | object | see below for details about this | none

close()

Terminate the PhantomJS web scraper process that will allow to end the current NodeJS script properly.

Request parameters spec

Parameter | Type | Description | Required -------- | --- | --- | --- url | string | target url | yes fct | function | JS function to inject into the page | yes fct | string | path to script path and function to inject separated by hash key (e.g. "path/to/script/script.js#functionToCall") | yes referer | string | referer header parameter set in each request | optional args | object | object passed to the injected function | optional debug | boolean | enable the debug mode (verbose) | optional