@lucascaro/wreck

v0.0.10

Published

2 years ago

Web Recursive and Efficiently Crawl and checK

Downloads

0High
0Medium
0Low

lucascaro

W.R.E.C.K. Web Crawler

WRECK is a fast, reliable, flexible web crawling kit.

Project Status

This project is in the design and prototype phase.

Roadmap

[x] Initial design and request for feedback
[x] First prototype for testing
- [x] Multi process crawling
- [x] Configurable per-process concurrency
- [x] HTTP and HTTPS support
- [x] HTTP retries
- [x] HEAD and GET requests
- [x] Shared work queue
- [x] Request rate limiting
- [x] Crawl depth
- [x] Limit to original domain
- [x] URL normalization
- [x] Exclude patterns
- [x] Persistent state across runs
- [x] Maximum request limit
- [x] Output levels
- [x] Simple reporting
- [x] Basic unit testing
- [ ] Nofollow patters
- [ ] Include patterns
[ ] Domain whitelist
[ ] Reporting
[ ] Unit testing
[ ] Functional testing
[ ] Incorporate design feedback
[ ] Code clean-up
[ ] Performance and memmory profiling and improvements
[ ] Implement all core features
[ ] Add to npm registry

Installing

npm i -g @lucascaro/wreck

Invoke it by running wreck.

Running

Show available commands

$ wreck

wreck v0.0.1

Usage:
 wreck               [options] [commands]   Reliable and Efficient Web Crawler

Options:
    -v --verbose                  Make operation more talkative.
    -s --silent                   Make operation silent (Only errors and warnings will be shown).
    -f --state-file    <fileName> Path to status file.

Available Subcommands:
   crawl

   report


 run wreck  help <subcommand> for more help.

Crawl

$ wreck help crawl

crawl

Usage:
 crawl               [options]

Options:
    -u --url           <URL>      Crawl starting from this URL
    -R --retries       <number>   Maximum retries for a URL
    -t --timeout       <number>   Maximum seconds to wait for requests
    -m --max-requests  <number>   Maximum request for this run.
    -n --no-resume                Force the command to restart crawling from scratch, even if there is saved state.
    -w --workers       <nWorkers> Start this many workers. Defaults to one per CPU.
    -d --max-depth     <number>   Maximum link depth to crawl.
    -r --rate-limit    <number>   Number of requests that will be made per second.
    -e --exclude       <regex>    Do now crawl URLs that match this regex. Can be specified multiple times.
    -c --concurrency   <concurrency> How many requests can be active at the same time.

Crawl an entire website

Default operation:

wreck crawl -u https://example.com

This will use the default operation mode:

1 worker process per CPU
100 maximum concurrent requests
save state to ./wreck.run.state.json
automatically resume work if state file is present
unlimited crawl depth
limit crawling to the provided main domain
no rate limit
3 maximum retries for urls that return a 429 status code

Minimal operation (useful for debugging):

wreck crawl -u https://example.com --concurrency=1 --workers=1 --rate-limit=1

Debug

Clone the repository:

git clone [email protected]:lucascaro/wreck.git
cd wreck
npm link

This project uses debug. Set the environment variable DEBUG to * to see all output:

DEBUG=* wreck crawl -u https://example.com

Contributing

Please feel free to add questions, comments, and suggestions via github issues.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme