npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

pdfmatch

v1.2.1

Published

Match text in images with Tesseract

Downloads

6

Readme

PDF Match

Convert images to PDFs with Tesseract, extract PDF text with pdftotext and execute commands depending on the content. The purpose is to scan paperwork, generate a PDF with text overlay and apply a set of rules to rename and move the PDF.

Usage

The pdfmatch command is used like this:

pdfmatch [options] source.{pdf,jpeg,tif,...} [target.pdf]

  Options:
    --config  Use the given config file
    --delete  Remove source file if match was found and command executed
     --debug  Don't create the PDF or execute commands, but print the text
          -l  Use the given language(s), overrides the configured "lang"

If no config file is specified, pdfmatch will look for a file named pdfmatch.json in the current directory.

If the source file is a PDF, the text is extracted with pdftotext and the configured rules are applied.

If the source file is not a PDF, it is expected to be an image and is converted to a PDF with searchable text using tesseract. If no target file is given, the base name of the image is used for the PDF. In a second step, the text is extracted with pdftotext and the configured rules are applied.

The configuration file can specify the language(s) to use with tesseract and a set of rules to apply. After the first match, the associated command is executed and processing is stopped. If no match was found the no-match command is executed.

Here is an example:

{
  "rules": [{
    "matches": [{
      "invoiceDate": "Invoive Date: ${DATE}"
    }, {
      "invoiceDate": "Ausstellungsdatum: ${DATE}"
    }],
    "command": "mv ${file} ${invoiceDate.format('YYYY-MM-DD')}\\ invoice.pdf"
  }],
  "no-match": "mv ${file} ${now.format('YYYY-MM-DD_HHmmss')}.pdf"
}

The configuration properties are:

  • lang: The language(s) to pass to Tesseract
  • rules: An array of rules to run, where each rule is an object with these properties:
    • match: A single match object or an array of match objects, passed to text-match
    • command: The command to execute, after substituting any JavaScript expressions
  • no-match: A default command to execute if no matching rule was found

The command can contain variables in the form ${...} where ... is a JavaScript expression with access to the matched properties. After successful substitution, the command is written to the console and executed using child_process.execSync(command).

These special properties can be accessed in commands:

  • file: The PDF file
  • now: The current date as a moment object

Install

Installing Tesseract with brew:

$ brew install tesseract --with-all-languages

Installing pdftotext (or download from http://www.foolabs.com/xpdf/download.html):

$ brew install Caskroom/cask/pdftotext

Installing this tool:

$ npm install pdfmatch -g

Example setup

My working setup is a ~/Documents/Scans folder containing only my pdfmatch.json configuration. The commands in the rules move the matched files one level up:

{
  "lang": "deu+eng",
  "rules": [{
    "match": {
      "company": "npm, Inc",
      "invoiceDate": "${DATE}"
    },
    "command": "mv ${file} ../${invoiceDate.format('YYYY-MM-DD')}\\ npm.pdf"
  }],
  "no-match": "mv ${file} ../${now.format('YYYY-MM-DD_HHmmss')}.pdf"
}

AppleScript folder action

If you're following the above example setup, there is an AppleScript folder action in ./scripts which allows you to save or drop files in a special folder and have pdfmatch invoked automatically. Follow the instructions in the header comments on how to use it.

API

This module exposes an API if required as a node module:

  • processText(pdf_file, config, callback): Extract text from a PDF file and applies rules from the given configuration (see above).
  • processImage(image_file, pdf_file, config, callback): Converts an image to a PDF file and then calls processText with the result.

License

MIT