npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

hub-crawl

v2.2.2

Published

Hub Crawl finds broken links in GitHub repos and wikis.

Downloads

28

Readme

npm version Dependency Status MIT Licence

Hub Crawl

Hub Crawl finds broken links in Github repositories. It finds links in the readme portions of the repos (or the wiki-content section for wiki pages), scrapes the links of those sections, and continues the crawl beginning with those newfound links. Additionally, the requests are made in parallel to ensure a speedy crawl. It essentially performs a concurrent, breadth-first graph traversal and logs broken links as it goes.

Installation

Global Use

To begin using Hub Crawl, install it globally with npm.

npm install -g hub-crawl

Or, if you use yarn:

yarn global add hub-crawl

Use in Projects

To add Hub Crawl to your project, install it with npm:

npm install hub-crawl

Or, if you use yarn:

yarn add hub-crawl

Terminology

Regardless of how you choose to implement Hub Crawl, the following are important terms:

entry

The entry is the url that is first visited and scraped for links.

scope

The scope is a url that defines the limit of link scraping. For example, let's assume the scope is set to https://github.com/louisscruz/hub-crawl. If https://github.com/louisscruz/hub-crawl/other is in the queue, it will be both visited and scraped. However, if https://google.com is in the queue, it will be visited, but not scraped because the url does not begin with the scope url. This keeps Hub Crawl from scouring the entire internet. If you do not provide a scope, Hub Crawl defaults to using the entry that was provided.

workers

The number of workers determines the maximum number of parallel requests to be open at any given time. The optimal number of workers depends on your hardware and internet speed.

Usage

There are two ways to use Hub Crawl. For common usage, it is likely preferable to use the command line. If you are integrating Hub Crawl into a bigger project, it is probably worth importing or requiring the Hub Crawl class.

Command Line

After Hub Crawl is installed globally, you can run hub-crawl in the command line. It accepts the following arguments and options in the following format:

hub-crawl [entry] [scope] -l -w 12

Arguments

[entry]

If not provided, the program will prompt you for this.

[scope]

If not provided, the program will prompt you for this.

Options

-l

If this option is provided, then an initial log in window will appear so that the crawl is authenticated while running. This is useful for private repos.

-w

If this option is provided, it will set the maximum number of workers. For instance, -w 24 would set a maximum of 24 workers.

-V

This option shows the current version of hub-crawl.

Importing

If you would like to use Hub Crawl in a project, feel free to import it as such:

import HubCrawl from 'hub-crawl';

Or, if you're still not using import:

var HubCrawl = require('hub-crawl');

Create an instance:

const crawler = new HubCrawl(12, 'https://google.com')

HubCrawl takes the following as arguments at instantiation:

HubCrawl(workers, entry[, scope]);

The methods available on HubCrawl instances can be seen here. However, the most important methods follow below.

traverseAndLogOutput(login)

This method performs the traversal and logs the broken links. The login argument determines whether or not an initial window will pop up to log in.

traverse(login)

This method performs the traversal. The login argument determines whether or not an initial window will pop up to log in. The traverse method returns the broken links. Note that the workers are left alive afterwards.

False Positives

As it currently stands, the crawler only makes a single, breadth-first, concurrent graph traversal. If it happens to be the case that a server was only temporarily down during the traversal, it will still count as a broken link. In the future, a second check will be made on each of the broken links to ensure that they are indeed broken.

Future Improvements

  • [x] Set the scope through user input, rather than defaulting to the entry
  • [x] Run the queries in parallel, rather than synchronously
  • [x] Make into NPM package
    • [x] Allow for CLI usage
    • [x] Also allow for fallback prompts
  • [ ] Perform a second check on all broken links to minimize false positives
  • [ ] Make the output look better
  • [ ] Allow for the crawler to be easily distributed