broken-links-inspector
v1.4.0
Published
Extract and recursively check all URLs reporting broken ones
Downloads
34
Maintainers
Readme
Broken Links Inspector
This project is heavily inspired by stevenvachon/broken-link-checker.
If you want to use this tool and need any help (instructions, bug fixes, features) open an issue!
Features:
- inspects a web-page and all its URLs, reports broken ones
- can go recursively, inspecting all pages within a domain
- makes requests in parallel, shows indication of "work in progress"
- does not check URL twice
- reports OK, TIMEOUT, ERROR CODE or generic error
- support configurable timeout
- supports GET and HEAD methods (double checks with GET if HEAD fails)
- supports a list of excluded URLs (glob matching) and/or excluded prefixes (e.g.
mailto:
) - can define OK codes, such as 999 for linkedin
- supports different reporting, such as colored console or JUnit file
- JUnit report is best used with CI (tested with GitLab)
- need a feature, go to issues
How to install and run
npm i -g broken-links-inspector
bli inspect https://dbogatov.org -r -t 2000 -s linkedin --reporters console
# or
# bli inspect file://links.txt
# with a URL per line in a file links.txt
................................................................................
................................................................................
........................
original request
OK : https://dbogatov.org/
OK: 1, skipped: 0, broken: 0
https://dbogatov.org/
OK : https://scholar.google.com/citations?user=Mq8ButkAAAAJ
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/docs/resume.pdf
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/docs/cv.pdf
OK : https://twitter.com/Dima4ka007
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/vendor/css/merged.css
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/vendor/js/merged.js
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/dmytro-bogatov.jpg
OK : https://dbogatov.org/contact
OK : https://dbogatov.org/research
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/favicon.ico
OK : https://dbogatov.org/publications
OK : https://www.googletagmanager.com/gtag/js?id=UA-65293382-4
OK : https://stackpath.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css
OK : https://git.dbogatov.org/dbogatov/research-website/commit/39ecd1a9
OK : https://dbogatov.org/projects
OK : https://www.facebook.com/dkbogatov
OK : https://dbogatov.org/education
OK : https://github.com/dbogatov
OK: 18, skipped: 3, broken: 0
https://dbogatov.org/education
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/config/grades.yml
OK: 1, skipped: 21, broken: 0
https://dbogatov.org/projects
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/projects/mandelbrot.png
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/projects/matters-proj.png
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/projects/shevastream.png
OK : https://github.com/WPIMHTC
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/projects/status-site.png
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/projects/bu-logo.png
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/projects/fabric.png
OK : https://github.com/dbogatov/shevastream
OK : https://legacy.dbogatov.org/Project/Mandelbrot
OK : https://github.com/dbogatov/legacy-website
OK : https://github.com/IBM/dac-lib
OK : https://github.com/dbogatov/status-site
OK : https://github.com/dbogatov/ore-benchmark
OK : https://shevastream.com/
OK : https://status.dbogatov.org/
OK : https://ore.dbogatov.org/
OK : http://matters.mhtc.org/
OK : https://dbogatov.org/assets/docs/dac-fabric.pdf
OK: 18, skipped: 21, broken: 0
https://dbogatov.org/publications
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/docs/mqp-paper.pdf
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/docs/econ-paper.pdf
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/docs/ore-presentation.pdf
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/docs/ore-poster.pdf
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/docs/ore-benchmark.pdf
OK : http://dispot.korkinlab.org/
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/docs/dac-fabric.pdf
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/docs/dispot.pdf
OK : https://hub.docker.com/r/korkinlab/dispot
OK : https://github.com/korkinlab/dispot
OK : https://digitalcommons.wpi.edu/cgi/viewcontent.cgi?article=2915&context=iqp-all
OK : https://dl.acm.org/doi/10.14778/3324301.3324309
OK : https://doi.org/10.14778/3324301.3324309
OK : https://doi.org/10.1093/bioinformatics/btz587
OK : https://academic.oup.com/bioinformatics/article/35/24/5374/5539863
OK: 15, skipped: 21, broken: 0
https://dbogatov.org/research
OK : http://people.cs.georgetown.edu/~kobbi/
OK : https://arxiv.org/abs/1706.01552
OK : https://www.cs.bu.edu/~reyzin/
OK : http://www.cs.bu.edu/~gkollios/
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/collaborators/bjoern.png
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/collaborators/kobi.jpg
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/collaborators/kellaris.jpeg
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/collaborators/lorenzo.png
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/collaborators/leo.png
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/collaborators/adam.jpg
OK : http://www.cs.bu.edu/fac/gkollios/
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/collaborators/kollios.png
OK : https://d3g9eenuvjhozt.cloudfront.net/assets/img/collaborators/pixel.jpg
OK : https://www.icloud.com/sharedalbum/
OK : https://www.cics.umass.edu/people/oneill-adam
OK : https://computerscience.uchicago.edu/people/profile/lorenzo-orecchia/
OK : https://midas.bu.edu/
OK : https://dblp.org/pers/t/Tackmann:Bj=ouml=rn.html
OK : https://dbogatov.org/assets/docs/ore-benchmark.pdf
OK : https://dbogatov.org/assets/docs/dac-fabric.pdf
OK: 20, skipped: 22, broken: 0
https://dbogatov.org/contact
OK: 0, skipped: 23, broken: 0
OK: 73, skipped: 111, broken: 0
How to use
$ bli inspect -h
Usage: index inspect [options] <url> <file://>
Check links in the given URL or a text file
Options:
-r, --recursive recursively check all links in all URLs within supplied host (ignored for file://) (default: false)
-t, --timeout <number> timeout in ms after which the link will be considered broken (default: 2000)
-g, --get use GET request instead of HEAD (default: false)
-s, --skip <globs> URLs to skip defined by globs, like '*linkedin*' (default: [])
--reporters <coma-separated-strings> Reporters to use in processing the results (junit, console) (default: ["console"])
--retries <number> The number of times to retry TIMEOUT URLs (default: 3)
--user-agent <string> The User-Agent header (default: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15
(KHTML, like Gecko) Version/14.1 Safari/605.1.15")
--ignore-prefixes <coma-separated-strings> prefix(es) to ignore (without ':'), like mailto: and tel: (default: ["javascript","data","mailto","sms","tel","geo"])
--accept-codes <coma-separated-numbers> HTTP response code(s) (beyond 200-299) to accept, like 999 for linkedin (default: [999])
--ignore-skipped Do not report skipped URLs (default: false)
--single-threaded Do not enable parallelization (default: false)
-v, --verbose log progress of checking URLs (default: false)
-h, --help display help for command
Return code is 1 if at least one broken link detected, 0 otherwise.
-r, --recursive
will instruct inspector to keep checking all URLs in the original domain.
Very useful for checking an entire website, such as personal blog.
For example, bli inspect https://yoursite.com -r
will check yoursite.com
and if it finds something like yoursite.com/contact
it will check that as well and will keep going.
It will check all URLs on all pages, but will not parse "external" pages.
-t, --timeout <number>
given in milliseconds sets a timeout for a request.
If this timeout is exceeded, the check fails with TIMEOUT.
-g, --get
instructs to use GET request instead fo the default HEAD request.
If HEAD request fails, the URL will be retried with GET.
-s, --skip <coma-separated-globs>
is a list of globs or parts of URL to skip.
As an example, -s *linkedin* -s hello
will instruct to skip all URLs which contain either linkedin
or hello
in them.
--reporters <coma-separated-strings>
is a list of reporters to process the result.
Currently there are two: console
and junit
.
console
will print appealing colored report to the console.
junit
will produce junit-report.xml
file in the current directory.
JUnit file treats pages as test suites and URLs in a page as test cases.
--retries
will instruct the number of times to try a URL before declaring it failed.
--user-agent <string>
will use specified User-Agent
header (some websites reply with 401 Unauthorized for "bots")
--ignore-prefixes <coma-separated-strings>
is a list of prefixes/ schemas to skip, such as mailto:
.
Provided list should not include colons.
--accept-codes <coma-separated-numbers>
is a list of HTTP code to consider successful, like 999 for linkedin.
--ignore-skipped
excludes skipped URLs from reports.
--single-threaded
mandates a sequential execution (should be used in for debugging).
-v, --verbose
currently unused.
How to build
npm install # to install dependencies
npm run build # to compile TS (result in ./dist/index.js)
npm run coverage # to run tests and coverage