npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

fresh-tabula-js

v2.0.0

Published

Extract CSV data from PDF tables using tabula-java.

Downloads

432

Readme

fresh-tabula-js

Convert tables inside PDFs to CSV via tabula-java using JavaScript.

Build Status Coverage Status semantic-release

This is a maintained fork of the tabula-js package, with changes such as:

  • Non-stream asynchronous extraction (use async/await)

Please submit any issues (or e-mail me).

Contents

Getting Started

Only Node.js environments are supported due to file-system usage requirements. The package is exported as a CommonJS module.

Requirements

  • Java Development Kit (JDK) with java available via command-line
  • Node.js/npm

Installing

To install as a dependency via npm:

$ npm install --save fresh-tabula-js

Usage

Import the module:

// 1. Import the module
const Tabula = require('fresh-tabula-js');
const extractData = async () => {
  // 2. Instantiate a table via passing a path to a PDF (this can be relative or absolute)
  const table = new Tabula('data/foobar.pdf');
  // 3. Call an extraction method
  return await table.getData();
};
// 4. Call the method!
const data = extractData();

API

First, an instance of Tabula must be instantiated via calling tabula with a path (relative or absolute) to a valid PDF.

Example:

const Tabula = require('fresh-tabula-js');
const table = new Tabula('path/to/pdf/foobar.pdf');
// Do stuff

Options

All extraction methods support the same set of options.

Options are passed through to tabula-java with some exceptions, such as the inability to write the output to file (-o). Extracted data is available through callbacks, streams, and return values.

Options are structured as a plain object.

| Key | Type | Default | Description | | - | - | - | - | | area | String or Array | Entire page | Co-ordinates of the portion(s) of the page to analyze, formatted in strings in the following format top,left,bottom,right. For example, 269.875,12.75,790.5,561 or ["269.875,12.75,790.5,561", "132.45,23.2,256.3,534"]. | columns | String | none | X coordinates of column boundaries. Example "10.1,20.2,30.3" | | debug | Boolean | false | Print detected table areas instead of processing them. | | guess | Boolean | true | Guess the portion(s) of the page to analyze and process. | | silent | Boolean | false | Suppresses all stderr output from the tabula-java JAR only. JavaScript errors will still be logged. | | noSpreadsheet | Boolean | false | Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet). | | pages | String | 1 | Comma separated list of ranges, or all. E.g. 1-3,5-7, 3, all. | spreadsheet | Boolean | false | Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet). | | password | String | empty | Password used to decrypt/access the document. | | useLineReturns | Boolean | false | Use embedded line returns in cells (only in spreadsheet mode). |

Methods

Tabula.getData

Use this method to process extracted data from PDF asynchronously using async/await.

It returns an object in the following format:

{
  output: <String>,
  error: <String>,
}

Example:

const Tabula = require('fresh-tabula-js');
const data = async () => {
  const table = new Tabula('dir/foobar.pdf');
  return await table.getData();
};

Tabula.streamSections

Use this method to process extracted data in sections (separate tables).

Callbacks will be executed for each parsed section of the PDF.

Extracted data is a string representing an array of all rows (in CSV format) found, including headers.

const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf');
table.streamSections((err, data) => console.log(data));

We can use the area option to analyze specific portions of the document.

const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf', {
  area: "269.875,150,690,545",
});
table.streamSections((err, data) => console.log(data));

Tabula.stream

This is used to process data from PDFs via streams.

Example:

const Tabula = require('fresh-tabula-js');
new Tabula('dir/foobar.pdf')
  .stream()
  .pipe(process.stdout);

The underlying library is built on streams using Highland.js.

This means the returned stream can perform highland-js-style transformations and operations.

Example:

const Tabula = require('fresh-tabula-js');
const stream = new Tabula('dir/foobar.pdf')
  .stream();
stream.split()
  .doto(console.log)
  .done(() => console.log('All done!'));

Developing

Introduction

Branch Usage

Development is done in the develop branch.

When master changes (e.g. via pull request), Travis CI will build and deploy a new version of the package using semantic versioning based on commit messages to determine the version type.

Commit Message Convention

Commit messages must be formatted according to the conventional commits Angular spec:

<type>[optional scope]: <description>

[optional body]

[optional footer]

The following types are supported:

  • build: Changes that affect the build system or npm dependencies
  • ci: Changes to CI config (e.g. Travis CI config changes)
  • docs: Documentation-only changes
  • feat: New features
  • fix: Bug fix
  • perf: Code change related to performance
  • refactor: A code change that neither fixes a bug nor adds a feature
  • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
  • test: Adding missing tests or correcting existing tests

Rules configuration is found in in release.config.js.

Installing

  1. Clone the repository.

  2. Switch to the develop branch:

    git checkout develop
  3. Install dependencies:

    $ npm install

Testing

To run tests:

$ npm run test

To run tests in watch mode:

$ npm run test:watch

To run test coverage:

$ npm run test:cov

Building

To run deployment builds:

$ npm run build

Deploying

  1. Push the changes to develop.
  2. Merge to master via pull request.

Travis CI will build and deploy the new version of the package (based on semantic commits) to NPM.

Acknowledgements