limador
v0.0.3
Published
Powerful Scraping and Crawling library with anti-scraping, scalability, storage, static/dynamic contents, monitoring UI and more. Ready to deploy on cloud instances or serverless.
Downloads
1
Maintainers
Readme
limador
Powerful Scraping and Crawling library with anti-scraping, scalability, storage, static/dynamic contents, monitoring UI and more. Ready to deploy on cloud instances or serverless.
NOTE: THIS IS WORK IN PROGRESS. YOU CANNOT USE IT YET.
Motivation
There are plenty of scraping frameworks, but the ones we found were:
- too simple (thin wrappers around Cheerio, Puppeteer, Playwright and the like)
- too complex (over engineered)
- too closed (requiring a subscription to unleash all potential)
- unscalable
- difficult to deploy to the cloud
- difficult to operate once deployed
Why Limador?
- simple for basic scraping => You can read the docs, try it and see results in less than 1h.
- powerful for complex projects => Including those that need scraping trigguers, cron jobs, conditionally/recursive crawling, multiple scrapers for different kinds of content, etc.
- ready to be deployed to the cloud => either in a single cloud instance or serverless.
- horizontally scalable => with serverless cloud functions or autoscaling pools of cloud instances.
- able to deal with anti-scraping => multiple rotating proxies, sticky sessions, human-like http requests (headers, cookies...), etc.
- monitoring UI => track jobs progress, logs, results, etc.
Getting Started
Creating a basic scraping project is as easy as follows:
cd repo-folder
npm init
npm install -s limador puppeteer
Let's extract trendinf topics from Twitter using Limador. Create a file called index.js and add these contents:
const { Limador } = require('limador')
let limador = await Limador.init({
queue: { type: 'memory' },
database: { type: 'sqlite-memory' },
batches: {
'scrape-google-title': {
title: 'Scrape Google Title',
jobs: (params) => [{
url: 'https://google.com',
tool: 'puppeteer',
call: 'pageScraper',
}]
}
},
scrapers: {
'pageScraper': async ({page}) => {
console.log(await page.title())
}
}
})
let batch = await limador.start('scrape-google-title')
await batch.done()
await limador.stop()
Run it with:
> node index.js
Limador is running...
See progress, logs and scraped data at http://localhost:4300
Full example
The following is a full scraper with all features:
- anti-scraping with: proxies, rate limit, human-like browsing
- development and production environments
- horizontal scaling
- progress UI
- storage of results in a database
- storage of screenshots in a bucket
- recursive crawling
index.js:
const { Limador, DataTypes } = require('limador')
const scrapers = {
'pageScraper': async ({job, db, page}) => {
// save screenshot
const title = await page.title()
await page.screenshot({ path: title + '.png' })
// store needed data in a database
const data = await page.evaluate(() => {
return Array.from(document.querySelectorAll('#text-field'))
})
for(const elem of data)
await db.insert('Foo', elem)
// collect links that need to be crawled next
const links = await page.evaluate(() => {
return Array.from(document.querySelectorAll('a'))
})
const jobs = links.map((link) => {
... scraper.job.config,
url: link,
cookies: page.cookies(),
cron: null
})
await batch.queue.addJobs(jobs) }
}
const queue = process.env.NODE_ENV === 'production' ? {
type: 'sqs',
accessKeyId: 'your s3 key',
secretAccessKey: 'your s3 secret',
name: 'queue name'
} : { type: 'memory' }
async function slave() {
let limador = new Limador({
slave: true,
queue,
scrapers,
database: { kind: 'sqlite-memory' }
})
await limador.init()
}
async function master() {
let limador = await Limador.init({
maxcpu: 50,
maxmem: 50,
onLimits: () => { console.log('Limits reached')},
api: { port: 5000 },
queue,
scrapers,
database: { type: 'sqlite-memory' },
batches: {
'batch-name': {
title: 'batch title',
params: [
{ name: 'cron', title: 'Periodicity', type: 'CRON', default: '0 3 * * *' },
{ name: 'url', tile: 'URL', type: 'STRING', default: 'https://google.es' }
}
jobs: (params) => return [{
url: params.url.value,
tool: 'puppeteer',
call: 'pageScraper',
}]
}
}
})
await limador.db.defineModel('Foo', {
name: DataTypes.TEXT
})
let batch = await limador.start('batch-name')
await batch.done()
await limador.stop()
}
if(process.env.MASTER || process.env.NODE_ENV !== 'production')
master()
else
slave()