npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

javadocs-scraper

v1.2.0

Published

A TypeScript library to scrape JavaDocs information.

Downloads

244

Readme

📚 javadocs-scraper

A TypeScript library to scrape Java objects information from a Javadocs website.

Build Status Latest Release View on Typedoc Built with Typescript

Specifically, it scrapes data (name, description, url, etc) about, and links together:

Some extra data is also calculated post scraping, like method and field inheritance.

[!CAUTION] Tested with Javadocs generated from Java 7 to Java 21. I cannot guarantee this will work with older or newer versions.

Contents

📦 Installation and Usage

  1. Install with your preferred package manager:
npm install javadocs-scraper
yarn add javadocs-scraper
pnpm add javadocs-scraper
  1. Instantiate a Scraper:
import { Scraper } from 'javadocs-scraper';

const scraper = Scraper.fromURL('https://...');

[!NOTE] This package uses constructor dependency injection for every class.

You can also instantiate Scraper with the new keyword, but you'll need to specify every dependency manually.

The easier way is to use the Scraper.fromURL() method, which will use the default implementations.

[!TIP] Alternatively, you can provide your own Fetcher to fetch data from the Javadocs:

import type { Fetcher } from 'javadocs-scraper';

class MyFetcher implements Fetcher {
  /** ... */
}

const myFetcher = new MyFetcher('https://...');
const scraper = Scraper.with({ fetcher: myFetcher });
  1. Use the Scraper to scrape information:
const javadocs: Javadocs = await scraper.scrape();

/** for example */
const myInterface = javadocs.getInterface('org.example.Interface');

[!TIP] The Javadocs object uses discord.js' Collection class to store all the scraped data. This is an extension of Map with utility methods, like find(), reduce(), etc.

These collections are also typed as mutable, so any modification will be reflected in the backing Javadocs. This is by design, since the library no longer uses this object once it's given to you, and doesn't care what you then do with it.

Check the discord.js guide or the Collection docs for more info.

🔒 Warnings

  • Make sure to not spam a Javadocs website. It's your responsibility to not abuse the library, and implement appropiate methods to avoid abuse, like a cache.
  • The scrape() method will take a while to scrape the entire website. Make sure to only run it when necessary, ideally only once in the entire program's lifecycle.

🔍 Specifics

There are distinct types of objects that hold the library together:

  • A Fetcher¹, which makes requests to the Javadocs website.
  • Entities², which represent a scraped object.
  • QueryStrategies¹, which query the website through cheerio. Needed since HTML class and ids change between Javadoc versions.
  • Scrapers¹, which scrape information from a given URL or cheerio object, to a partial object.
  • Partials², which represent a partially scraped object, that is, an object without circular references to other objects.
  • A ScraperCache, which caches partial objects in memory.
  • Patchers¹, which patch partials to make them full entities, by linking them together.
  • Javadocs, which is the final result of the scraping process.

¹ - Replaceable via constructor injection.

² - Only a type, not available in runtime.

The scraping process ocurs in the following steps:

  1. A QueryStrategy is chosen by the QueryStrategyFactory.
  2. The RootScraper iterates through every package in the Javadocs root.
  3. For every package, it's fetched, and passed to the PackageScraper.
  4. The PackageScraper iterates through every class, interface, enum and annotation in the package and passes them to the appropriate Scraper.
  5. Each scraper creates a partial object, and caches it in the ScraperCache.
  6. Once everything is done, the Scraper uses the Patchers to patch the partial objects together, by passing the cache to each patcher.
  7. The Scraper returns the patched objects, in a Javadocs object.

[!TIP] You can provide your own QueryStrategyFactory to change the way the QueryStrategy is chosen.

import { OnlineFetcher } from 'javadocs-scraper';

const myFetcher = new OnlineFetcher('https://...');
const factory = new MyQueryStrategyFactory();
const scraper = Scraper.with({
  fetcher: myFetcher,
  queryStrategyFactory: factory
});