web-tree-crawl
v1.1.4
Published
Simple framework for crawling/scraping web sites. The result is a tree, where each node is a single request.
Downloads
6
Readme
Note to english-speakers
Many comments, issues, etc. are partially written in german. If you want something translated create an issue and I'll take care of it.
Introduction
web-tree-crawl is available on npm and on gitlab
Idea
The crawling process is tree-shaped: You start with a single URL (the root), download a document (a node) and discover new URLs (child nodes) which in turn will be downloaded. So every crawled document is a node in the tree and every URL is an edge. The tree spans only over new edges; edges to already known urls will be stored, but not processed.
The end result will be a tree representing the crawl-process. All discovered information will be stored in this tree.
The main difference between different crawler is which URLs and which data will be scraped from discovered documents. So those two scraper will need to be supplied by the user while the library web-tree-crawl takes care of everything else.
Example
note: everything here is ECMA6
So lets say you want the last couple of comics from xkcd.com. All you have to do is
"use strict";
const crawler = require('web-tree-crawl');
// configure your crawler
let ts = new crawler("https://xkcd.com");
ts.config.maxRequests = 5;
ts.config.dataScraper = crawler.builtin.dataScraper.generalHtml;
ts.config.urlScraper = crawler.builtin.urlScraper.selectorFactory('a[rel="prev"]');
// exectute!
ts.buildTree(function (root) {
// print discovered data to std::out
console.log(JSON.stringify(crawler.builtin.treeHelper.getDataAsFlatArray(root), null, "\t"));
});
For more examples see: https://gitlab.com/wotanii/web-tree-crawl/tree/master/examples
Details/Dokumentation
Use web-tree-crawl like this:
- create the object & set the initial url
- modify the config-object
- call buildTree & wait for the callback
Config
You will always want to define those config-attributes :
- maxRequests: how many documents may be crawled?
- dataScraper: what data do you want to find?
- urlScraper: how does the crawler look for new urls?
There are more, but their defaults work well on most websites and are pretty much self-explanatory (if not, let me know by opening an issue).
Url Scraper
These are functions, that scrape urls from a document. The crawler will apply this function to all crawled documents to discover new documents.
Create your own url scraper or use a builtin. All url scraper must have this signature:
- parameters
- string: content of current document
- string: url of current document
- returns
- string[]: discovered urls
Data Scraper
These are functions, that scrape data from a document. The crawler will apply this function to all crawled documents to decide what data to store for this document.
The crawler will not use this data in anyway, so you can return whatever you want.
Create your own data scraper or use a builtin. All data scraper must have this signature:
- parameters
- string: content of current document
- string: current node
- returns
- anything
Builtin
There are some static builtin function, that you don't need to use, but they will make your life easier. Some of those function can be used directly and some are factories, that return those functions.
Url Scraper
These are functions, that will scrape for urls in a usual manner. Use them by putting them in your config like this:
ts.config.urlScraper = treeCrawler.builtin.urlScraper.selectorFactory('a[rel="prev"]');
Data Scraper
These are functions, that will scrape for data in a usual manner. Use them by putting them in your config like this:
ts.config.dataScraper = treeCrawler.builtin.dataScraper.generalHtml;
Tree Helper
These are functions, that help to extract information from the result-tree. Use these function once buildTree has finished.
They will modify your tree (e.g. treeCrawler.builtin.treeHelper.addParentsToNodes
) or they will extract data from your tree (e.g. crawler.builtin.treeHelper.getDataAsFlatArray
)
Dev-Setup
sudo apt install npm nodejs
git clone [email protected]:wotanii/web-tree-crawl.git
cd web-tree-crawl/
npm install
npm test
if tests fail with your set-up, either create an issue or comment on an existing issue