npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

tesseract_native

v0.5.2

Published

C++ module for node providing OCR with tesseract and leptonica

Downloads

11

Readme

node-tesseract-native

C++ module for node providing OCR with tesseract and leptonica

Prerequisites

  • Have linux or OSX (at least these are tested OSes)
  • Have node (>= 0.12.0) and node-gyp installed
  • Have leptonica (~1.68) libs and headers installed
  • Have tesseract (~3.02) libs and headers installed

Build

Checkout the repository and build it yourself using

node-gyp configure && node-gyp build

or use npm

npm install tesseract_native

Supported Picture Formats

The module can handle every picture format leptonica can handle (see there), but as this module is likely to be used in an online service, pictures should be as small as possible. A 1.3 MegaPixel picture converted to B/W using adaptive threshold filtering, saved as PNG will be 50KB on average. This is were you want to go.

Test your setup

You can test your setup using the provided test.js script on the command-line

$ node test.js HelloWorld.jpg

Example server

The code below shows a fully functional server where you can POST pictures to. The response will contain the recognized plain text or be empty if nothing was recognized or something went wrong.

var tesseract = require('tesseract_native');
var http = require('http');

var server = http.createServer(function(request, response)
{
    if(request.method === 'POST')
    {
        var totalSize = 0;
        var bufferList = new Array();
        var myOcr = new tesseract.OcrEio();
        
        request.on('data', function(data) {
            bufferList.push(data);
            totalSize += data.length;
            if (totalSize > 1e6) {
                console.log('Request body too large');
                request.connection.destroy();
            }
        });
        
        request.on('end', function() {
            var buffer = Buffer.concat(bufferList, totalSize);
            myOcr.ocr(buffer, function(err, result) {
                if(err) {
                    response.writeHead(500, {'Content-Type': 'text/plain'});
                    response.end("Error " + err);
                } else {
                    response.writeHead(200, {'Content-Type': 'text/plain'});
                    response.end(result);
                }
            });
        });
        
    } else {
        request.connection.destroy();
    }
}).listen(process.argv[2]);

Parameters

The OCR function also accepts a config object as second and the callback as third parameter like this:

myOcr.ocr(buffer, { lang:"eng", rect:[0,0,400,400] }, function(result) {
    // do something
});

The first supported parameter is tessdata, which is the path to you Tesseract data directory (/usr/local/share/tessdata/ by default). The second is lang which can be any three-character code for a language you have installed with Tesseract (eng by default). The third is rect, which is an array describing a rect of the form [X, Y, WIDTH, HEIGHT] limiting the image region for recognition. If you try the above rect with the provided test image it should land you in hell... Another parameter psm which takes an Integer from 0 to 10 configuring the page segmentation mode as in the table below. The default mode is 3.

Value | Meaning ------|----- 0 | Orientation and script detection (OSD) only. 1 | Automatic page segmentation with OSD. 2 | Automatic page segmentation, but no OSD, or OCR. 3 | Fully automatic page segmentation, but no OSD. (Default) 4 | Assume a single column of text of variable sizes. 5 | Assume a single uniform block of vertically aligned text. 6 | Assume a single uniform block of text. 7 | Treat the image as a single text line. 8 | Treat the image as a single word. 9 | Treat the image as a single word in a circle. 10 | Treat the image as a single character.

Why?

The question may arise. I've seen many tesseract wrappers for node and none of them I found did it quite right, some of them even did it wrong. The philosophy (and necessity) behind node is not to block, so everything that does work has to do it asynchronously and emit an event/execute a closure when it's done. If you don't do that, your node application will simply not perform well.

But even in this code you can see a very crude solution, performance-wise. The tesseract api is instantiated and initialized on every call to the ocr method. Why did't I do that when loading the module or when the constructor gets called? It has multiple reasons:

  • Simplicity: Initializing tesseract involves file system access, that means, must be performed asynchronously. The OCR work is done by adding the function to the asynchronous uv_queue_work, for simplicity I bundled all time-consuming tasks in the function that gets passed to the queue. So even though initializing wastes some cycles, it is still perfectly non-blocking.

  • Flexiblity: The language setting is passed on initialization, so by initializing on each request, the language used for detection can be set with each request.

  • Robustness: The tesseract context may not be thread-safe. There are hints in the tesseract code that suggest that. I will look further into it and I will be experimenting with a version that initializes the tesseract context at load time.