npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@avodah-engineering/pdf-text-reader

v5.1.1

Published

Dead simple pdf text reader

Downloads

66

Readme

PDF Text Reader

Dead simple PDF text reader for Node.js. Uses Mozilla's pdfjs-dist package.

Requires ESM and Node.js v22 or greater. (These are requirements from Mozilla's pdf-dist package itself.)

Install

npm install pdf-text-reader

Usage

  • Read all pages into a single string with readPdfText:

    import {readPdfText} from 'pdf-text-reader';
    
    async function main() {
        const pdfText: string = await readPdfText({url: 'path/to/pdf/file.pdf'});
        console.info(pdfText);
    }
    
    main();
  • Read a PDF into individual pages with readPdfPages:

    import {readPdfPages} from 'pdf-text-reader';
    
    async function main() {
        const pages = await readPdfPages({url: 'path/to/pdf/file.pdf'});
        console.info(pages[0]?.lines);
    }
    
    main();

See the types for detailed argument and return value types.

Details

This package simply reads the output of pdfjs.getDocument and sorts it into lines based on text position in the document. It also inserts spaces for text on the same line that is far apart horizontally and new lines in between lines that are far apart vertically.

Example:

The text below in a PDF will be read as having spaces in between them even if the space characters aren't in the PDF.

cell 1               cell 2                 cell 3

The number of spaces to insert is calculated by an extremely naive but very simple calculation of Math.ceil(distance-between-text/text-height).

Low Level Control

If you need lower level parsing control, you can also use the exported parsePageItems function. This only reads one page at a time as seen below. This function is used by readPdfPages so the output will be identical for the same pdf page.

You may need to independently install the pdfjs-dist npm package for this to work.

import * as pdfjs from 'pdfjs-dist';
import type {TextItem} from 'pdfjs-dist/types/src/display/api';
import {parsePageItems} from 'pdf-text-reader';

async function main() {
    const doc = await pdfjs.getDocument('myDocument.pdf').promise;
    const page = await doc.getPage(1); // 1-indexed
    const content = await page.getTextContent();
    const items: TextItem[] = content.items.filter((item): item is TextItem => 'str' in item);
    const parsedPage = parsePageItems(items);
    console.info(parsedPage.lines);
}

main();