npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

site2pdf-cli

v0.1.5

Published

Generate comprehensive PDFs of entire websites, ideal for RAG.

Downloads

358

Readme

site2pdf

This tool generates a PDF file containing the main page and all sub-pages of a website that match a provided URL pattern.

📗The PDF generated by this tool is particularly well-suited for AI-based Retrieval-Augmented Generation (RAG) and Question Answering (QA) tasks.📗

Motivation

🧳Portability: Combining multiple pages of a website into a single file enhances portability, making it easier to share and use the information.
🤖AI Integration: In some use cases, such as with Google NotebookLM and ChatGPT GPTs, providing a master dataset in PDF format helps in creating more efficient bots.
🖼️Visual Information Preservation: By generating results in PDF format, visual information like images is preserved, ensuring better recognition by multimodal models.

Prerequisites


To run this software, you need to have Node.js installed on your machine. You can download and install the latest version of Node.js from the official Node.js website.

Dependencies(Linux)

This project uses the following dependencies:

sudo apt-get update
sudo apt-get install -y libxkbcommon0
sudo apt-get install -y libnss3 libxss1 libasound2
sudo apt-get install -y fonts-liberation libappindicator3-1 libatk-bridge2.0-0 libatspi2.0-0 libgtk-3-0 libgbm-dev

Usage

npx site2pdf-cli <main_url> [url_pattern]

Arguments

  • <main_url>: The main URL of the website to be converted to PDF.
  • [url_pattern]: Optional regular expression to filter sub-links. Defaults to matching only links within the main URL domain.

Example

npx site2pdf-cli "https://www.typescriptlang.org/docs/handbook/" "https://www.typescriptlang.org/docs/handbook/2/"
> [email protected] start
> tsx index.ts https://www.typescriptlang.org/docs/handbook/ https://www.typescriptlang.org/docs/handbook/2/

Generating PDF for: https://www.typescriptlang.org/docs/handbook/
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/basic-types.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/everyday-types.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/narrowing.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/functions.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/objects.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/classes.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/modules.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/types-from-types.html
PDF saved to ./out/www-typescriptlang-org-docs-handbook.pdf

This command will generate a PDF file named www.typescriptlang.org-docs-handbook.pdf containing all pages on the https://www.typescriptlang.org/docs/handbook/ domain that match the pattern https://www.typescriptlang.org/docs/handbook/2/.

Troubleshooting for Windows

When running Puppeteer on Windows, you may encounter permission issues related to generating PDFs. To resolve this, you need to grant appropriate permissions. Follow these steps:

icacls %USERPROFILE%/.cache/puppeteer/chrome /grant *S-1-15-2-1:(OI)(CI)(RX)

Troubleshooting - Chrome reports sandbox errors on Windows| Puppeteer

Implementation Details

  • Navigates to the main page using puppeteer.
  • Finds all sub-links matching the provided url_pattern.
  • Generates a PDF for each sub-link using pdf-lib and merges them into a single document.
  • Saves the final PDF file with a slugified name based on the main URL.

Note: The provided url_pattern should be a valid regular expression. If no url_pattern is provided, the tool will default to matching only links within the main URL domain.

This tool is still under development and may have limitations. Feel free to contribute to the project by opening issues or pull requests!

Development

Prerequisites

Ensure you have Node.js and npm installed. You will also need a modern version of TypeScript and other dependencies specified in package.json.

Setup

Clone the repository and install the dependencies:

git clone https://github.com/laiso/site2pdf.git
cd site2pdf
npm install

Building

The project uses TypeScript. To compile the TypeScript files, run:

npm run build

Running the Project

You can run the project in development mode with:

npm run dev

This command uses tsx to watch for changes and recompile as necessary.

Testing

The project uses Jest for testing. To run the tests, execute:

npm test

Linting

Linting is configured using Biome. To check for linting issues, run:

npx biome lint

Code Formatting

To format the code according to the project's style guidelines, run:

npx biome format

Contributing

Feel free to open issues or pull requests. Make sure to follow the existing code style and include tests for new features or bug fixes.

Notes

  • The project uses ES modules. Ensure your Node.js version supports this.
  • Update dependencies as necessary, and ensure compatibility with existing code.