npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

pdftojson

v0.0.3

Published

pdftotext wrapper that generates JSON with bounding box data. Takes care of duplicate characters.

Downloads

813

Readme

pdftojson

Build Status Coverage Status

pdftojson is a pdftotext wrapper that generates JSON with bounding box data. It takes care of overlapping duplicated characters, which often exists in MS-Word-generated PDF files with floating images and text.

Why bother a wrapper for pdftotext?

Consider this PDF file:

PDF sample

pdftotext -bbox theFile.pdf would generate this:

...
<word xMin="103.320000" yMin="547.355700" xMax="152.368008" yMax="561.321720">(6)綠線</word>
<word xMin="155.880000" yMin="547.355700" xMax="176.846541" yMax="561.321720">G01</word>
<word xMin="155.880000" yMin="547.355700" xMax="162.867200" yMax="561.321720">G</word>
<word xMin="180.300000" yMin="547.355700" xMax="222.295867" yMax="561.321720">站延伸</word>
<word xMin="208.080000" yMin="547.355700" xMax="264.053062" yMax="561.321720">伸至大溪</word>
<word xMin="264.480000" yMin="547.355700" xMax="334.420485" yMax="561.321720">、龍潭先進</word>
<word xMin="320.340000" yMin="547.355700" xMax="348.294390" yMax="561.321720">進公</word>
<word xMin="124.680000" yMin="572.375700" xMax="166.675867" yMax="586.341720">共運輸</word>
<word xMin="152.700000" yMin="572.375700" xMax="222.644667" yMax="586.341720">輸系統發展</word>
<word xMin="208.440000" yMin="572.375700" xMax="278.395867" yMax="586.341720">展委託可行</word>
<word xMin="264.840000" yMin="572.375700" xMax="320.813062" yMax="586.341720">行性研究</word>
...

pdftotext does a great job "undoing" physical layout (columns, hyphenation, etc) of a PDF document. However, in its result there are some overlapping and duplicate words. PDF layout engines sometimes generate these quirks when images and text are mixed within a page.

On the other hand, pdftojson theFile.pdf could generate this:

...
{
    "xMin": 103.2,
    "xMax": 348.29439,
    "yMin": 547.3557,
    "yMax": 561.32172,
    "text": "(6)綠線 G01 站延伸至大溪、龍潭先進公"
},
{
    "xMin": 124.68,
    "xMax": 320.813062,
    "yMin": 572.3757,
    "yMax": 586.34172,
    "text": "共運輸系統發展委託可行性研究"
}
...

Install

$ npm install pdftojson

pdftojson uses pdftotext. Please make sure pdftotext is available in PATH.

Usage

pdftojson is available as a command line tool and a nodejs library.

CLI

# outputs some.json
$ pdftojson some.pdf

# converts page 3 ~ 6 of some.pdf and outputs to some.json
$ pdftojson -c "-f 3 -l 6" some.pdf

NodeJS Library

The library exposes a single function that takes the name of a PDF file and returns a promise.

import pdftojson from 'pdftojson';

pdftojson("./some.pdf").then((output) => {
  // output is a Javascript object.
});

Output format

All numeric values are in pt.

[
  { //: Page
    width: (Number) page width,
    height: (Number) page height,
    words: [
      {
        text: (String) the text enclosed in the bounding box,

        // All coordinates calculated from top-left corner of the page
        xMin: (Number) left edge of the bounding box,
        xMax: (Number) right edge of the bounding box,
        yMin: (Number) top edge of the bounding box,
        yMax: (Number) bottom edge of the bounding box
      }, // ...
    ]
  }, // ...
]