@robertsvendsen/node-crawler
v0.5.4
Published
Crawls web urls from a list
Downloads
25
Readme
node-crawler
Crawls web urls from a list
Very simple wrapper for puppeteer, with the most basic requirements for a crawler inluded.
Install
npm install @robertsvendsen/node-crawler
If you are getting this:
Error: Failed to launch the browser process! undefined
Fontconfig error: No writable cache directories
https://pptr.dev/troubleshooting#could-not-find-expected-browser-locally
If you had that problem, and you fixed it with ENV during install, you must always keep the environment variable: PUPPETEER_CACHE_DIR=$(pwd)
Example
import Crawler, { CrawlerPageOptions } from '@robertsvendsen/node-crawler/src/crawler'
const options = new CrawlerOptions({
name: 'node-crawler-agent',
concurrency: 1,
readRobotsTxt: true,
dataPath: 'data/crawler',
});
const crawler = new Crawler(options);
const links = [{ url: "https://www.google.com" }];
init().then(async () => {
console.info('Crawling complete');
// await delay(10000); // If you have troubles with the script exits before crawling completed make a delay here. The queue is empty but crawling is not.
await crawler.close();
process.exit();
});
async function init() {
const pageOptions = new CrawlerPageOptions({ downloadImages: true });
for (const link of links) {
crawler.add(link.url, pageOptions).then((result) => {
if (result) {
console.info('Crawled', link.url);
}
}
// To avoid saturating the CPU immediately on startup we don't fill the queue up all the way.
await crawler.queue.onSizeLessThan(options.concurrency * 2);
}
await crawler.queue.onEmpty();
}
Options
CrawlerBrowserOptions
width = 1920; // 3840
height = 1080; // 2160
isLandscape = false;
isMobile = false;
hasTouch = false;
CrawlerPageOptions
downloadImages = false;
returnPageInstance = false; // If true, you must close it yourself.
timeout = 10000; // Page load timeout in ms.
waitUntil = 'networkidle2';
CrawlerOptions
concurrency = 1;
readRobotsTxt = true;
name = 'node-crawler'; // This should just be the name, no version or anything.
version = '0.1';
email = ''; // contact email for this crawler.
dataPath = 'data';
saveAsPDF = false; // Enable PDF file generation /printing of the site.
saveFiles = true; // Handle this yourself? set to true.
headless = true;
Roadmap
- [x] Concurrency using threads or processes. Actually, it might be possible to just increase the prop because puppeteer should be able to handle more tabs.
- [ ] Recursive crawling options
- [ ] When crawling recursive, it should handle the robots.txt delay as well.
- [ ] Callback-function in options to determine if a link should be queued (for recursive search)
- Own database (sqllite3)
- [ ] table: sites (site_id, domain, url)
- [ ] table: site_options
- [ ] table: pages (page_id, site_id, path, querystring, last_visited, status_code, redirect_location)
- [ ] table: sites (site_id, domain, url)
- Logo fetcher (upper left corner, name contains 'logo'?)