groq-ocr

v1.0.6

Published

4 months ago

a library to run OCR with Groq provided models.

Downloads

0High
0Medium
0Low

jrada

AI LLMs GROQ OPEN SOURCE

Disclaimer

This project is still in development‼️

Multi-page PDF support is experimental and work in progress.

PDF support relies on pdftopic library which requires node>=12 and imagemagick.

JSON mode might fail with json_validate_failed error

Installation

npm i groq-ocr to use as an NPM package.

npm i -g groq-ocr to use as a CLI.

Usage

Use as NPM package:

import { ocr, GroqVisionModel } from "groq-ocr";
const result = await ocr({
  filePath: "./filepath.jpg", // Allowed formats: jpg, jpeg, png, pdf.
  apiKey: process.env.GROQ_API_KEY, // Get your API key from https://console.groq.com/
  model: GroqVisionModel.LLAMA_32_90B, // available models: LLAMA_32_11B, LLAMA_32_90B. Default: LLAMA_32_11B
  jsonMode: false, // Default: false. Set to true to get JSON output.
  additionalInstructions: "Additional instructions to be included in the prompt.", // Use to give custom instructions to the model.
});

ocr options:

filePath (required): Path to image/PDF file or URL
- Supported formats: .jpg, .jpeg, .png, .pdf
apiKey (optional): Groq API key
- Defaults to GROQ_API_KEY environment variable
model (optional): Vision model to use
- GroqVisionModel.LLAMA_32_11B (default) - Llama 3.2 11B Vision Preview
- GroqVisionModel.LLAMA_32_90B - Llama 3.2 90B Vision Preview
jsonMode (optional): Return structured JSON instead of markdown
- Defaults to false
additionalInstructions (optional): Additional instructions to be included in the prompt.
- Defaults to "" - use to give custom instructions to the model.

Use as CLI:

Either set your Groq API key as environment variable:

export GROQ_API_KEY=your-api-key

Or provide it as CLI option with -k flag when running commands.

CLI Examples

# Basic usage
groq-ocr -f image.jpg

# Output as JSON
groq-ocr -f scan.pdf -j

# Save to file
groq-ocr -f receipt.png -o result.txt

# Use specific model and API key
groq-ocr -f document.jpg -m llama-3.2-90b-vision-preview -k your-api-key

CLI Options

-f, --file <path> (required): Path to input image/PDF file
-k, --api-key <key>: Groq API key (defaults to GROQ_API_KEY env var)
-m, --model <model>: Vision model to use:
- llama-3.2-11b-vision-preview (default)
- llama-3.2-90b-vision-preview
-j, --json: Output in JSON format instead of markdown
-o, --output <path>: Write result to file instead of console
-V, --version: Display version number
-h, --help: Display help information

How it works

This library and CLI uses multimodal models with vision capabilities provided by Groq to run OCR on images and PDFs and return markdown or JSON.

PDFs are converted to images using pdftopic.

Models

The plan is to support all models provided by Groq with vision capabilities. Groq vision models

Currently supported models:

enum GroqVisionModel {
  LLAMA_32_11B = "llama-3.2-11b-vision-preview",
  LLAMA_32_90B = "llama-3.2-90b-vision-preview",
}

Roadmap

[x] Add support for local images OCR
[x] Add support for remote images OCR
[x] Add support for single page PDFs
[x] Add support for JSON output in addition to markdown
[x] Add CLI
[x] extend prompt with custom instructions
[ ] Add support for multi-page PDFs OCR (Available but experimental)

Credit

This project was highly inspired by llama-ocr.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Table of Contents