site2pdf-cli
v0.1.5
Published
Generate comprehensive PDFs of entire websites, ideal for RAG.
Downloads
358
Readme
site2pdf
This tool generates a PDF file containing the main page and all sub-pages of a website that match a provided URL pattern.
📗The PDF generated by this tool is particularly well-suited for AI-based Retrieval-Augmented Generation (RAG) and Question Answering (QA) tasks.📗
Motivation
🧳Portability: Combining multiple pages of a website into a single file enhances portability, making it easier to share and use the information.
🤖AI Integration: In some use cases, such as with Google NotebookLM and ChatGPT GPTs, providing a master dataset in PDF format helps in creating more efficient bots.
🖼️Visual Information Preservation: By generating results in PDF format, visual information like images is preserved, ensuring better recognition by multimodal models.
Prerequisites
To run this software, you need to have Node.js installed on your machine. You can download and install the latest version of Node.js from the official Node.js website.
Dependencies(Linux)
This project uses the following dependencies:
sudo apt-get update
sudo apt-get install -y libxkbcommon0
sudo apt-get install -y libnss3 libxss1 libasound2
sudo apt-get install -y fonts-liberation libappindicator3-1 libatk-bridge2.0-0 libatspi2.0-0 libgtk-3-0 libgbm-dev
Usage
npx site2pdf-cli <main_url> [url_pattern]
Arguments
<main_url>
: The main URL of the website to be converted to PDF.[url_pattern]
: Optional regular expression to filter sub-links. Defaults to matching only links within the main URL domain.
Example
npx site2pdf-cli "https://www.typescriptlang.org/docs/handbook/" "https://www.typescriptlang.org/docs/handbook/2/"
> [email protected] start
> tsx index.ts https://www.typescriptlang.org/docs/handbook/ https://www.typescriptlang.org/docs/handbook/2/
Generating PDF for: https://www.typescriptlang.org/docs/handbook/
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/basic-types.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/everyday-types.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/narrowing.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/functions.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/objects.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/classes.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/modules.html
Generating PDF for: https://www.typescriptlang.org/docs/handbook/2/types-from-types.html
PDF saved to ./out/www-typescriptlang-org-docs-handbook.pdf
This command will generate a PDF file named www.typescriptlang.org-docs-handbook.pdf
containing all pages on the https://www.typescriptlang.org/docs/handbook/
domain that match the pattern https://www.typescriptlang.org/docs/handbook/2/
.
Troubleshooting for Windows
When running Puppeteer on Windows, you may encounter permission issues related to generating PDFs. To resolve this, you need to grant appropriate permissions. Follow these steps:
icacls %USERPROFILE%/.cache/puppeteer/chrome /grant *S-1-15-2-1:(OI)(CI)(RX)
Troubleshooting - Chrome reports sandbox errors on Windows| Puppeteer
Implementation Details
- Navigates to the main page using
puppeteer
. - Finds all sub-links matching the provided
url_pattern
. - Generates a PDF for each sub-link using
pdf-lib
and merges them into a single document. - Saves the final PDF file with a slugified name based on the main URL.
Note: The provided url_pattern
should be a valid regular expression. If no url_pattern
is provided, the tool will default to matching only links within the main URL domain.
This tool is still under development and may have limitations. Feel free to contribute to the project by opening issues or pull requests!
Development
Prerequisites
Ensure you have Node.js and npm installed. You will also need a modern version of TypeScript and other dependencies specified in package.json
.
Setup
Clone the repository and install the dependencies:
git clone https://github.com/laiso/site2pdf.git
cd site2pdf
npm install
Building
The project uses TypeScript. To compile the TypeScript files, run:
npm run build
Running the Project
You can run the project in development mode with:
npm run dev
This command uses tsx
to watch for changes and recompile as necessary.
Testing
The project uses Jest for testing. To run the tests, execute:
npm test
Linting
Linting is configured using Biome. To check for linting issues, run:
npx biome lint
Code Formatting
To format the code according to the project's style guidelines, run:
npx biome format
Contributing
Feel free to open issues or pull requests. Make sure to follow the existing code style and include tests for new features or bug fixes.
Notes
- The project uses ES modules. Ensure your Node.js version supports this.
- Update dependencies as necessary, and ensure compatibility with existing code.