declarative-scraper

v0.1.1

Published

3 years ago

Simple & Human-Friendly HTML Scraper with Json-ld support

Downloads

0High
0Medium
0Low

gaetanlegac

scraper spider crawler crawling scraping web html extract extractor jsonld data

Dopamyn Scraper

Simple & Human-Friendly HTML Scraper with Proxy Rotator.

/!\ WARNING: This package is not enough mature to be used in production.

Installation

npm install --save declarative-scraper

Usage Example

// Import dependencies
import Scraper, { gotAdapter, Action } from 'declarative-scraper';
import got from 'got';

// Configure your scraper
const scraper = new Scraper({

    // Use the got package to make our http requests
    adapter: gotAdapter(got),
    // Show debug infos 
    debug: true,
    // If an error occurs while extracting item infos, we stop scraping by throwing an error
    onItemError: Action.EXCLUDE,

});

// Scrape Cryptocurrencies list
const results = await scraper.scrape({

    // 1. Basic options
    id: 'cryptocurrencies', // Identifier for debugging
    url: 'https://coinmarketcap.com/', // URL address to scrape

    // 2. Extraction
    items: $ => $('table.cmc-table > tbody > tr'), // Items to iterate
    extract: ($) => ({ // Data to extract for each item

        logo: $('> td:eq(2) img.coin-logo').attr('src'),

        // The current item will be excluded from results if the name can't be extracted
        name: $('> td:eq(2) p[font-weight="semibold"]').text()?.trim() || Action.EXCLUDE,

        price: $('> td:eq(3)').text()

    }),

    // 3. Processing
    required: ['name', 'price'], // If name or price cannot be extracted, an error will be thrown
    process: async ({ logo, name, price }) => ({ // Normalize / Format the extracted data

        logo,

        name: name.trim(),

        price: parseFloat( price.trim().replace(/[^\d\.]/g, '') )

    }),

})

Output:

[
    {
        "logo": "https://s2.coinmarketcap.com/static/img/coins/64x64/1.png",
        "name": "Bitcoin",
        "price": 48415.71
    },
    {
        "logo": "https://s2.coinmarketcap.com/static/img/coins/64x64/1027.png",
        "name": "Ethereum",
        "price": 3634.48
    },
    {
        "logo": "https://s2.coinmarketcap.com/static/img/coins/64x64/2010.png",
        "name": "Cardano",
        "price": 2.49
    },
    {
        "logo": "https://s2.coinmarketcap.com/static/img/coins/64x64/1839.png",
        "name": "Binance Coin",
        "price": 429.91
    },
    {
        "logo": "https://s2.coinmarketcap.com/static/img/coins/64x64/825.png",
        "name": "Tether",
        "price": 1
    },
    {
        "logo": "https://s2.coinmarketcap.com/static/img/coins/64x64/52.png",
        "name": "XRP",
        "price": 1.12
    },
    {
        "logo": "https://s2.coinmarketcap.com/static/img/coins/64x64/5426.png",
        "name": "Solana",
        "price": 161.09
    },
    {
        "logo": "https://s2.coinmarketcap.com/static/img/coins/64x64/6636.png",
        "name": "Polkadot",
        "price": 35.9
    },
    {
        "logo": "https://s2.coinmarketcap.com/static/img/coins/64x64/74.png",
        "name": "Dogecoin",
        "price": 0.2461
    },
    {
        "logo": "https://s2.coinmarketcap.com/static/img/coins/64x64/3408.png",
        "name": "USD Coin",
        "price": 1
    }
]

Proxy Rotator

A proxy can be useful if the website you want to scrape has protections against automated traffic Since most scraping proxies limits the number of requests, we use the included proxy rotator to switch to another proxy when we reached the limit on the current one

import Scraper, { ProxyRotator } from 'declarative-scraper';

const scraper = new Scraper({
    ...
    proxy: new ProxyRotator({
        zenscrape: {
            prefix: 'https://app.zenscrape.com/api/v1/get?apikey=<key>>&url=',
            getRemaining: () => got('https://app.zenscrape.com/api/v1/status?apikey=<key>>', {
                responseType: 'json'
            }).then(res => {
                console.log(`[proxy][getRemaining] zenscrape`, res.body);
                return res.body['remaining_requests'] as number;
            })
        },
        ...
    })
});

TODO

Better doc
Strict type checking
Fix typings for extracted data
Tests

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Dopamyn Scraper

Installation

Usage Example

Proxy Rotator

TODO