web-crawl
v0.1.0-beta.5
Published
Utility for crawling websites
Downloads
3
Readme
This is a simple web scraping helper module that I threw together to help me setup a web crawler whenever I needed one.
##Setup/configuration
The web scraper just needs a config object with these keys
name
: The name of the crawlertype
: The type of crawler (currently just WebCrawler, adding more types as time goes on)params
: Parameters for start point. Uses request-promise library paramsdelay
: How many seconds to wait between page hopssettings
: How to crawl the website, currently only RexExp objects are used. Supports both individual or arrays of RexExp objects. Any link that is not followed or scraped will be ignored.follow
: Object/Array of regular expressions for links to follow ('click' on)scrape
: Object/Array of regular expressions for links to scrape data fromignore
: (OPTIONAL) Object/Array of regular expressions for links to ignore. These will not be checked to be scraped or followed.
parse
: directory of parsers with what to scrape off of each website. Example in next sectionoutput
: Where to put the data when the crawl is completed
const Scraper = require('web-crawl')
let exampleScraper = new Scraper({
name: 'Example Crawler',
type: 'WebCrawler',
params: {
uri: 'https://www.example.com',
headers: {
'User-Agent': 'Some Way To Identify Me'
}
},
delay: 3,
settings: {
follow: new RegExp('https:\/\/www.exmaple.com\/data'),
scrape: new RegExp('\/data\/specific\/'),
ignore: [new RegExp('comments'), new RegExp('about-us')]
},
parse: require('./ScraperModules'),
output: require('scraper-writer')
})
exampleScraper.start()
##Parser Setup
Parsers are very simple modules that contain an xPath string and process function.
xPath
: xPath for selecting what to scrapeprocess
: Function on how to parse the scrape. Result is a wrapped response that has both extract() and extract_first() functions. The extract() function returns all matching results in an array. The extract_first() function returns the first item of that array.
In your directory, currently you need an index.js file like below that contains the exports of your parsers
module.exports = {
name: require('./name.js')
description: require('./description.js')
}
Example parser file
module.exports = {
xPath: '//h1[@id=\'huge-feature-box-title\']/text()',
process: result => {
return result.extract_first()
}
}
New: You can also use an array of xpath strings in the xpath valuefor if you want more than one item parsed for a given file.
##Output Setup
This is just a simple module that has a write function. I have a basic file writer one that can be used. Users are welcome to create their own as well.
let fs = require('fs')
module.exports = {
write: data => {
fs.writeFile('results.json', JSON.stringify(data, null, 1), err => {
if (err)
console.error(err)
})
}
}