norch-crawlers
v0.0.3
Published
A NodeJS crawler library to quick and easy build versatile crawlers.
Downloads
13
Readme
A NodeJS crawler library to quick and easy build versatile crawlers. Just to make working with request
and cheerio
a little easier and to not have to write all the standard stuff over and over again.
Functions
- Play nice with servers: Wait between each request.
- Get ´next´ and ´last´ URL for pagination scenario.
- Write list syncronusly to file at the end
- Serving header info
Examples
- List crawling: Crawl paginated lists for URLs
Functionality to be
- [ ] Item crawling
- [ ] Pagination iteration, second version
- [ ] Define which domain(s) to crawl
- [ ] Site-crawl - Add found URLs to crawl queue
- [ ] Write content asyncronusly (add to file) throughout crawling.
- [ ] Follow robots.txt
- [ ] Check if new content
- [ ] Check if updated content
- [ ] Overwrite crawler header and set ´from´-field.
- [ ] Crawl with headless browser.