npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

node-ts-ocr

v1.0.15

Published

A simple wrapper around command-line utils to assist in PDF / Image OCR (Optical Character Recognition) processing using Tesseract.

Downloads

643

Readme

Node Typescript OCR

License Current Version npm

A simple wrapper around command-line utils to assist in PDF / Image OCR (Optical Character Recognition) processing using Tesseract.

Test Coverage

Coverage lines Coverage functions Coverage branches Coverage statements

Installation

npm install node-ts-ocr --save

Dependencies

After installing node.ts.ocr, the following binaries need to be on your system, as well as in the paths in your environment settings.

PDF To Text & PDF Info

Many PDF's already have plain text embedded in them, either because they were born-digital (i.e. created from a word processing document) or because OCR was already performed on them. If we are able to extract the text using this utility we do not need to perform image conversion and subsequently OCR.

OSX

pdftotext & pdfinfo are included as part of the xpdf utilities library.

brew install xpdf

Ubuntu

pdftotext & pdfinfo are included in the poppler-utils library.

sudo apt-get install poppler-utils

CLI Example

Attempt to extract the text from a PDF:

pdftotext /path/to/document.pdf output.txt

ImageMagick & Ghostscript

A PDF is a jumble of instructions for how to render a document on a screen or page. Although it may contain images, a PDF is not itself an image, and therefore we can't perform OCR on it directly. To convert PDF's to images, we use ImageMagick's convert function which depends on Ghostscript.

OSX

brew install imagemagick
brew install gs

Ubuntu

sudo apt-get update
sudo apt-get install imagemagick --fix-missing
sudo apt-get install ghostscript

CLI Example

Convert a PDF to a TIFF representation:

convert -density 300 /path/to/document.pdf -depth 8 -strip -background white -alpha off image.tiff

Tesseract

Tesseract is Open Source OCR Engine.

OSX

brew install tesseract

Ubuntu

sudo apt-get install tesseract-ocr

CLI Example

Once we have a TIFF representation of the document, we can use Tesseract to (attempt to) extract the plain text:

tesseract image.tiff output.txt

Usage

import { Ocr } from 'node-ts-ocr';
import * as path from 'path';
import * as temp from 'temp';

export async function getPdfText(fileName: string): Promise<string> {
	// Assuming your file resides in a directory named sample
	const relativePath = path.join('sample', fileName);
	const filePath = path.join(__dirname, relativePath);
	// Extract the text and return the result
	return await Ocr.extractText(filePath);
}

Methods

extractInfo(filePath: string)

Retrieve the pdf info using the pdfinfo binary and parse the result to a key value object.

extractText(filePath: string, options?: ExtractTextOptions)

Extracts the text from the pdf using the pdftotext binary

invokePdfToTiff(outDir: string, filePath: string, options?: ExtractTextOptions)

Converts a PDF file to its TIFF representation using the convert binary

invokeImageOcr(outDir: string, imagePath: string, options?: ExtractTextOptions)

Performs OCR on an image in order to extract the text using the tesseract binary

Options

ExtractTextOptions

The arguments are key value pairs of valid command line arguments for the respective binary.

ExtractTextOptions {
  pdfToTextArgs?: KeyValue;
  convertArgs?: KeyValue;
  tesseractArgs?: KeyValue;
}

Example pdfToTextArgs that only includes page 1 to 4.

Note: this will only work if you already have a searchable PDF, because the pdftotext binary can only be used to extract text from a searchable PDF.

{ pdfToTextArgs: { f: 1, l: 4 } }

Example convertArgs that sets the convert density to 600, and the trim option to on.

{ convertArgs: { density: '600', trim: '' } }

Example tesseractArgs that sets the language to english, the page segmentation mode to 6, and preserves interword spaces.

{ tesseractArgs: { 'l': 'eng', '-psm': 6, 'c': 'preserve_interword_spaces=1' } }

Docker

Coming Soon...