jason-the-miner

v1.1.1

Published

4 years ago

Harvesting data at the <html> mine.

0High
0Medium
0Low

mawrkus

scrape scraper scraping crawl crawler crawling web html cheerio axios

Jason the Miner

Node version

Harvesting data at the <html> mine... Jason the Miner, a versatile Web scraper for Node.js.

⛏ Features

Composable: via a modular architecture based on pluggable processors. The output of one processor feeds the input of the next one. There are 3 types of processors:
1. loaders: to fetch the data (via HTTP requests, by reading text files, etc.)
2. parsers: to parse the data (HTML by default) & extract the relevant parts according to a predefined schema
3. transformers: to transform and/or output the results (to a CSV file, via email, etc.)
Configurable: each processor can be chosen & configured independently
Extensible: you can register your own custom processors
CLI-friendly: Jason the Miner works well with pipes & redirections
Promise-based API
MIT-licensed

⛏ Installing

$ npm install -g jason-the-miner

⛏ Demos

Clone the project...

$ git clone https://github.com/mawrkus/jason-the-miner.git
$ cd jason-the-miner
$ npm install
$ npm run demos

...and have a look at the "demos" folder, among them, you'll find scraping:

Simple GitHub search (JSON, CSV, Markdown output)
Extended GitHub search with issues (including following links & paginating issues)
Goodreads books and following to Amazon to grab their product ID
Google search and follow search results for finding mobile apps
IMDb images gallery links (with pagination)
Mixcloud stats, templating them & sending them by mail
Mixcloud SPA scraping controlling a headless browser
Avatars download
Bulk insertions to Elasticsearch from a CSV file
...

⛏ Examples

CLI

Scraping the most popular Javascript scrapers from GitHub:

// github-config.json
{
  "load": {
    "http": {
      "url": "https://github.com/search",
      "params": {
        "q": "scraper",
        "l": "JavaScript",
        "type": "Repositories",
        "s": "stars",
        "o": "desc"
      }
    }
  },
  "parse": {
    "html": {
      "repos": [".repo-list .repo-list-item h3 > a"]
    }
  },
  "transform": {
    "json-file": {
      "path": "./github-repos.json"
    }
  }
}

$ jason-the-miner -c github-config.json

Alternatively, with pipes & redirections:

// github-config.json
{
  "parse": {
    "html": {
      "repos": [".repo-list .repo-list-item h3 > a"]
    }
  }
}

$ curl https://github.com/search?q=scraper&l=JavaScript&type=Repositories&s=stars&o=desc | jason-the-miner -c github-config.json > github-repos.json

API

const JasonTheMiner = require('jason-the-miner');

const jason = new JasonTheMiner();

const load = {
  http: {
    url: "https://github.com/search",
    params: {
      q: "scraper",
      l: "JavaScript",
      type: "Repositories",
      s: "stars",
      o: "desc"
    }
  }
};

const parse = {
  html: {
    "repos": [".repo-list .repo-list-item h3 > a"]
  }
};

jason.harvest({ load, parse }).then(results => console.log(results));

⛏ The config file

{
  "load": {
    "[loader name]": {
      // loader options
    }
  },
  "parse": {
    "[parser name]": {
      // parser options
    }
  },
  "transform": {
    "[transformer name]": {
      // transformer options
    }
  }
}

Loaders

Jason the Miner comes with 5 built-in loaders:

| Name | Description | Options | | --- |---| --- | | http | Uses axios as HTTP client | All axios request options + [_concurrency=1] (to limit the number of concurrent requests when following/paginating) & [_cache] (to cache responses on the file system) | | browser | Uses puppeteer as browser | puppeteer launch, goto, screenshot, pdf and evaluate options | | file | Reads the content of a file | path, [stream=false], [encoding="utf8"] & [_concurrency=1] (to limit the number of concurrent requests when paginating) | | csv-file | Uses csv-parse to read a CSV file | All csv-parse options in a csv object + path+ [encoding="utf8"] | | stdin | Reads the content from the standard input | [encoding="utf8"] |

For example, an HTTP load config which responses will be cached in the "tests/http-cache" folder:

...
"load": {
  "http": {
    "baseURL": "https://github.com",
    "url": "/search?l=JavaScript&o=desc&q=scraper&s=stars&type=Repositories",
    "_concurrency": 2,
    "_cache": {
      "_folder": "tests/http-cache"
    }
  }
}
...

Check the demos folder for more examples.

Parsers

Currently, Jason the Miner comes with a 2 built-in parsers:

| Name | Description | Options | | --- |---| --- | |html|Parses HTML, built with Cheerio|A parse schema| |csv|Parses CSV, built with csv-parse|All csv-parse options|

HTML schema definition

Examples

...
  "html": {
    // Single value
    "repo": ".repo-list .repo-list-item h3 > a"

    // Collection of values
    "repos": [".repo-list .repo-list-item h3 > a"]

    // Single object
    "repo": {
      "name": ".repo-list .repo-list-item h3 > a",
      "description": ".repo-list .repo-list-item div:first-child"
    }

    // Single object, providing a root selector _$
    "repo": {
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child"
    }

    // Collection of objects
    "repos": [{
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child"
    }]

    // Following
    "repos": [{
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child",
      "_follow": {
        "_link": "h3 > a",
        "stats": {
          "_$": ".pagehead-actions",
          "watchers": "li:nth-child(1) a.social-count",
          "stars": "li:nth-child(2) a.social-count",
          "forks": "li:nth-child(3) a.social-count"
        }
      }
    }]

    // Paginating
    "repos": [{
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child",
      "_paginate": {
        "_link": ".pagination > a[rel='next']",
        "_depth": 1
      }
    }]
  }
...

Full flavour

...
  "html": {
    "title": "title | trim",
    "metas": {
      "lang": "html < attr(lang)",
      "content-type": "meta[http-equiv='Content-Type'] < attr(content)"
    },
    "stylesheets": ["link[rel='stylesheet'] < attr(href)"],
    "repos": [{
      "_$": ".repo-list .repo-list-item ? text(crawler)",
      "_slice": "0,3",
      "name": "h3 > a",
      "last-update": "relative-time < attr(datetime)",
      "_follow": {
        "_link": "h3 > a",
        "description": "meta[property='og:description'] < attr(content) | trim",
        "url": "link[rel='canonical'] < attr(href)",
        "stats": {
          "_$": ".pagehead-actions",
          "watchers": "li:nth-child(1) a.social-count | trim",
          "stars": "li:nth-child(2) a.social-count | trim",
          "forks": "li:nth-child(3) a.social-count | trim"
        },
        "_follow": {
          "_link": ".js-repo-nav span[itemprop='itemListElement']:nth-child(2) > a",
          "open-issues": [{
            "_$": ".js-navigation-container li > div > div:nth-child(3)",
            "desc": "a:first-child | trim",
            "opened": "relative-time < attr(datetime)"
          }],
          "_paginate": {
            "_link": "a[rel='next']",
            "_slice": "0,1",
            "_depth": 2
          }
        }
      }
    }],
  }
...

As you can see, a schema is a plain object that recursively defines:

the names of the values/collection of values that you want to extract: "title" (single value), "metas" (object), "stylesheets" (collection of values), "repos" (collection of objects)
how to extract them: [selector] ? [matcher] < [extractor] | [filter] (check "Parse helpers" below)

Additional instructions can be passed to the parser:

_$ acts as a root selector: further parsing will happen in the context of the element identified by this selector
_slice limits the number of elements to parse, like String.prototype.slice(begin[, end])
_follow tells Jason to follow a single link (fetch new data) & to continue scraping after the new data is received
_paginate tells Jason to paginate (fetch & scrape new data) & to merge the new values in the current context, here multiple links can be selected to scrape in parallel multiple pages

Parse helpers

The following syntax specifies how to extract a value:

[property name]: [selector] ? [matcher] < [extractor] | [filter]

For instance:

...
"repos": [".repo-list-item h3 > a ? text(crawler) < attr(title) | trim"]
...

Will extract a "repos" array of values from the links identified by the ".repo-list-item h3 > a" selector, matching only the ones containing the text "crawler". The values will be retrieved from the "title" attribute of each link and will be trimmed.

Jason has 4 built-in element matchers:

text(regexString)
html(regexString)
attr(attributeName,regexString)
slice(begin,end)

They are used to test an element in order to decide whether to include/discard it from parsing. If not specified, Jason includes every element.

7 built-in text extractors:

text([optionalStaticText]) (by default)
html()
attr(attributeName)
regex(regexString)
date(inputFormat,outputFormat) (parses a date with moment)
uuid() (generates a uuid v1 with uuid)
count() (counts the number of elements matching the selector, needs an array schema definition)

and 5 built-in text filters:

trim
single-space
lowercase
uppercase
json-parse (to parse JSON, like JSON-LD)

Transformers

| Name | Description | Options | | --- |---| --- | | stdout | Writes the results to stdout | [encoding="utf8"] | | json-file | Writes the results to a JSON file | path & [encoding="utf8"] | | csv-file | Writes the results to a CSV file using csv-stringify | csv: same as csv-stringify + path, [encoding='utf8'] and [append=false] (whether to append the results to an existing file or not) | | download-file | Downloads files to a given folder using axios | [baseURL], [parseKey], [folder='.'], [namePattern='{name}'], [maxSizeInMb=1] & [concurrency=1] | email | Sends the results by email using nodemailer | Same as nodemailer, split between the smtp and message options |

Jason supports a single transformer or an array of transformers:

{
  ...
  "transform": [{
    "json-file": {
      "path": "./github-repos.json"
    }
  }, {
    "csv-file": {
      "path": "./github-repos.csv"
    }
  }]
}

⛏ Bulk processing

Parameters can be defined in a CSV file and applied to configure the processors:

{
  "bulk": {
    "csv-file": {
      "path": "./github-search-queries.csv",
      "csv": {
        "columns": true,
        "delimiter": ","
      }
    }
  },
  "load": {
    "http": {
      "baseURL": "https://github.com",
      "url": "/search?l={language}&o=desc&q={query}&s=stars&type=Repositories",
      "_concurrency": 2
    }
  },
  "parse": {
    "html": {
      "title": "< text(Best {language} repos)",
      "repos": [".repo-list .repo-list-item h3 > a"]
    }
  },
  "transform": {
    "json-file": {
      "path": "./github-repos-{language}.json"
    }
  }
}

github-search-queries.csv :

language,query
JavaScript,scraper
Python,scraper

⛏ API

constructor({ fallbacks = {} } = {})

fallbacks defines which processor to use when not explicitly configured (or missing in the config file):

load: 'identity',
parse: 'identity',
transform: 'identity',
bulk: null

The fallbacks change when using the CLI (see bin/jason-the-miner.js):

load: 'stdin',
parse: 'html',
transform: 'stdout',
bulk: null

loadConfig(configFile)

Loads a config from a JSON or JS file.

jason.loadConfig('./harvest-me.json');

harvest({ bulk, load, parse, transform } = {})

Launches the harvesting process:

jason
  .loadConfig('./config.json')
  .then(() => jason.harvest())
  .catch(error => console.error(error));

You can pass custom options to temporarily override the current config:

jason
  .loadConfig('./config.json')
  .then(() => jason.harvest({
    load: {
      http: {
        url: "https://github.com/search?q=scraper&l=Python&type=Repositories"
      }
    }
  }))
  .catch(error => console.error(error));

To permanently override the current config, you can modify Jason's config property:

const allResults = [];

jason
  .loadConfig('./harvest-me.json')
  .then(() => jason.harvest())
  .then((results) => {
    allResults.push(results);

    jason.config.load.http.url = 'https://github.com/search?q=scraper&l=Python&type=Repositories';

    return jason.harvest();
  })
  .then((results) => {
    allResults.push(results);
  })
  .catch(error => console.error(error));

registerHelper({ category, name, helper })

Registers a parse helper in one of the 3 categories: match, extract or filter. helper must be a function.

const url = require('url');

jason.registerHelper({
  category: 'filter',
  name: 'remove-query-params',
  helper: (href = '') => {
    if (!href || href === '#') {
      return href;
    }

    const { protocol, host, pathname } = url.parse(href);

    return `${protocol}//${host}${pathname}`;
  }
});

registerProcessor({ category, name, processor })

Registers a new processor in one of the 3 categories: load, parse or transform. processor must be a class implementing the run() method:

jason.registerProcessor({
  category: 'transform',
  name: 'template',
  processor: Templater
});

class Templater {
  constructor({ config }) {
    // receives automatically its config
  }

  /**
   * @param {*} results
   * @return {Promise.<*>}
   */
  run({ results }) {
    // must be implemented & must return a promise.
  }
}

jason.config.transform = {
  template: {
    "templatePath": "my-template.tpl",
    "outputPath": "my-page.html"
  }
};

Be aware that loaders must also implement the getConfig() and buildLoadOptions({ link }) methods. Have a look at the source code for more info.

⛏ Testing

$ git clone https://github.com/mawrkus/jason-the-miner.git
$ cd jason-the-miner
$ npm install
$ npm run test

⛏ Resources

Web Scraping With Node.js: https://www.smashingmagazine.com/2015/04/web-scraping-with-nodejs/
X-ray, The next web scraper. See through the noise: https://github.com/lapwinglabs/x-ray
Simple, lightweight & expressive web scraping with Node.js: https://github.com/eeshi/node-scrapy
Node.js Scraping Libraries: http://blog.webkid.io/nodejs-scraping-libraries/
https://www.scrapesentry.com/scraping-wiki/web-scraping-legal-or-illegal/
http://blog.icreon.us/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/
Web scraping o rastreo de webs y legalidad: https://www.youtube.com/watch?v=EJzugD0l0Bw
Scraper API blog: https://www.scraperapi.com/blog/

⛏ A final note...

Please take these guidelines in consideration when scraping:

The content being scraped is not copyright protected.
The act of scraping does not burden the services of the site being scraped.
The scraper does not violate the Terms of Use of the site being scraped.
The scraper does not gather sensitive user information.
The scraped content adheres to fair use standards.