smeagol

v0.2.0

Published

2 years ago

A easy to use NodeJS http/https web-crawler.

Downloads

0High
0Medium
0Low

gserrano

crawler scrapper

Smeagol

Smeagol is a very simple NodeJS crawler module where you can create url patterns to extract different contents from different pages.

Install smeagol

npm install smeagol

How to use

Require Smeagol

var Smeagol = require('smeagol');

Instance and settings

let smeagol = new Smeagol(
    {
        crawl : [
            {
                pattern_url : '^http://g1.globo.com/economia/noticia/(.*)?$', 
                id : 'news',
                each_item : '#glb-materia',
                find : {
                    id    : '$(".share-bar").attr("data-url")',
                    title   : '$(".entry-title").text()'
                }
            }
        ],
        limit: 6,
        continuous : true,
        maxConcurrency: 6,
        domain : 'http://g1.globo.com/',
        pattern_to_crawl : '^http://g1.globo.com/economia/noticia/(.*)?$'
    }
);

"pattern_url" define what pages Smeagol will scrap. "id" is the identification for the result group in Smeagol results. "each_item" is a CSS selector. Smeagol will iterate this selector on the page and extract the data defined in "find". "find" is a object with label and CSS selector for each information you want to get from each "each_item".

Crawl

Just start crawling!

smeagol.crawl({
    uri : 'http://g1.globo.com/economia/'
})

Events

Smeagol uses nodeJs events to let you decide what to do when you get the information you want to scrap.

####complete(results)#### Emitted when Smeagol complete scrapping or scrap the limit pages in settings.

smeagol.on('complete', function(results){
    console.log(results);
    console.log('Finished');
})

####crawl(result)#### Emitted every item (each_item in setting) Smeagol scrap.

result is a json object. url is the page url where Smeagol scrapped the result.

smeagol.on('crawl', function(url, result){
    console.log('crawl', url, result);
})

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme