@utils-js/link-scraper
v1.0.3-alpha
Published
A command line utility to get all links of a URL, recursively to a given depth for whitelisted domains
Downloads
18
Maintainers
Readme
Link Scraper
A command-line utility to fetch Links of a given seed URL. It will also recursively fetch links for a given depth.
This utility provides an interactive command-line user interface as well as command line options.
Usage
Command line options
-u, --url <string> seed url
-w, --whitelisted <string> Whitelisted url
-o, --outFile <string> Output file name
-e, --extension <string> depth limit to recursively scrape
-d, --depth <number> depth limit to recursively scrape
-q, --query Consider Query Params for URL Uniqueness
-h, --hash Ignore Hash Params for URL Uniqueness
-s, --secure Scrape only secured URLs (https: only)
--no-hash Ignore Hash Params for URL Uniqueness
--no-query Ignore Hash Params for URL Uniqueness
--no-secure Allow scraping Non secure URLs (http: & https:)
--help display help for command
Using CLI Interactive Questions
Example 1
Scrape Medium.com
for depth 2
for only secure
urls i.e. https:
and for whitelisted domains medium.com
, help.medium.com
and save the output inside data
folder with file name medium-links
in md
and tsv
format.
To consider the uniqueness consider the query
params and ignore the hash
params.
link-scraper -u https://medium.com/ -w https://medium.com,https://help.medium.com -o data/medium-links -e tsv,md -d 2 -qs --no-hash
Example 2
Partially init the application from command line and put other fields from the interactive interface.
For command line options: set the depth as 2
and set extension as tsv
. consider only https:
URLs and for uniqueness test
consider the query
params and ignore the hash
params.
Seed URL, whitelisted domains and file path will be entered from command line.
link-scraper -qs --no-hash -d 2 -e tsv