page-piranha
v0.1.3
Published
LLM based page parser (PDF to text)
Downloads
25
Maintainers
Readme
Page Piranha 🦈
Use LLMs to convert PDFs to text, markdown, or JSON. By default, page piranha uses the Gemini 2.0 Flash model.
Features
- Convert PDFs to plain text, markdown, or JSON
- Support for local files and remote URLs
- Pipe-friendly CLI interface
- Progress indicators and colorful output
- Configurable output directory
- Optional custom prompts for fine-tuned conversions
Installation
npm install page-piranha
Environment Setup
Page Piranha requires Google Cloud Platform credentials to use Vertex AI. Create a .env
file with the following variables:
GCP_PROJECT=your_gcp_project
GCP_LOCATION=your_gcp_location
GOOGLE_APPLICATION_CREDENTIALS=path_to_your_gcp_credentials_file
CLI Usage
Basic usage:
.bin/page-piranha -f input.pdf -m text -o output
Options:
-f, --file <file>
- The PDF file to convert (required)-m, --mode <mode>
- Conversion mode: text, markdown, or json (default: text)-o, --outDir <directory>
- Output directory (default: out)-t, --tee
- Output to both file and stdout-v, --verbose
- Enable verbose logging-p, --prompt <prompt>
- Additional hints for conversion
Examples:
Convert to text
.bin/page-piranha -f document.pdf -m text
Convert to markdown with custom output directory
.bin/page-piranha -f document.pdf -m markdown -o converted
Convert to JSON and pipe to jq
.bin/page-piranha -f assets/demo.pdf -m json -p "Make sure to use camel case. This is an invoice. Feel free to nest fields" -t | jq
Programmatic Usage
Page Piranha can be used programmatically in your TypeScript/JavaScript projects:
import { PagePiranha } from 'page-piranha';
import { JorEl } from 'jorel';
// Initialize
const jorEl = new JorEl({ vertexAi: true });
const piranha = new PagePiranha(jorEl);
// Convert to text
const text = await piranha.toText('document.pdf');
// Convert to markdown with additional prompt
const markdown = await piranha.toMarkdown('document.pdf', 'Focus on headers and lists');
// Convert to JSON
const json = await piranha.toJson('document.pdf');
API Reference
PagePiranha Class
Constructor
constructor(jorEl: JorEl, options?: PagePiranhaOptions)
Methods
toText(fileOrFiles: string | Buffer, additionalPrompt?: string): Promise<string>
toMarkdown(fileOrFiles: string | Buffer, additionalPrompt?: string): Promise<string>
toJson(fileOrFiles: string | Buffer, additionalPrompt?: string): Promise<object>
License
MIT