fresh-tabula-js

v2.0.0

Published

3 years ago

Extract CSV data from PDF tables using tabula-java.

Downloads

450

0High
0Medium
0Low

cdtinney

pdf parser tables csv pdf to csv

fresh-tabula-js

Convert tables inside PDFs to CSV via tabula-java using JavaScript.

This is a maintained fork of the tabula-js package, with changes such as:

Non-stream asynchronous extraction (use async/await)

Please submit any issues (or e-mail me).

Getting Started

Only Node.js environments are supported due to file-system usage requirements. The package is exported as a CommonJS module.

Requirements

Java Development Kit (JDK) with java available via command-line
Node.js/npm

Installing

To install as a dependency via npm:

$ npm install --save fresh-tabula-js

Usage

Import the module:

// 1. Import the module
const Tabula = require('fresh-tabula-js');
const extractData = async () => {
  // 2. Instantiate a table via passing a path to a PDF (this can be relative or absolute)
  const table = new Tabula('data/foobar.pdf');
  // 3. Call an extraction method
  return await table.getData();
};
// 4. Call the method!
const data = extractData();

API

First, an instance of Tabula must be instantiated via calling tabula with a path (relative or absolute) to a valid PDF.

Example:

const Tabula = require('fresh-tabula-js');
const table = new Tabula('path/to/pdf/foobar.pdf');
// Do stuff

Options

All extraction methods support the same set of options.

Options are passed through to tabula-java with some exceptions, such as the inability to write the output to file (-o). Extracted data is available through callbacks, streams, and return values.

Options are structured as a plain object.

Methods

`Tabula.getData`

Use this method to process extracted data from PDF asynchronously using async/await.

It returns an object in the following format:

{
  output: <String>,
  error: <String>,
}

Example:

const Tabula = require('fresh-tabula-js');
const data = async () => {
  const table = new Tabula('dir/foobar.pdf');
  return await table.getData();
};

`Tabula.streamSections`

Use this method to process extracted data in sections (separate tables).

Callbacks will be executed for each parsed section of the PDF.

Extracted data is a string representing an array of all rows (in CSV format) found, including headers.

const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf');
table.streamSections((err, data) => console.log(data));

We can use the area option to analyze specific portions of the document.

const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf', {
  area: "269.875,150,690,545",
});
table.streamSections((err, data) => console.log(data));

`Tabula.stream`

This is used to process data from PDFs via streams.

Example:

const Tabula = require('fresh-tabula-js');
new Tabula('dir/foobar.pdf')
  .stream()
  .pipe(process.stdout);

The underlying library is built on streams using Highland.js.

This means the returned stream can perform highland-js-style transformations and operations.

Example:

const Tabula = require('fresh-tabula-js');
const stream = new Tabula('dir/foobar.pdf')
  .stream();
stream.split()
  .doto(console.log)
  .done(() => console.log('All done!'));

Developing

Introduction

Branch Usage

Development is done in the develop branch.

When master changes (e.g. via pull request), Travis CI will build and deploy a new version of the package using semantic versioning based on commit messages to determine the version type.

Commit Message Convention

Commit messages must be formatted according to the conventional commits Angular spec:

<type>[optional scope]: <description>

[optional body]

[optional footer]

The following types are supported:

build: Changes that affect the build system or npm dependencies
ci: Changes to CI config (e.g. Travis CI config changes)
docs: Documentation-only changes
feat: New features
fix: Bug fix
perf: Code change related to performance
refactor: A code change that neither fixes a bug nor adds a feature
style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
test: Adding missing tests or correcting existing tests

Rules configuration is found in in release.config.js.

Installing

Clone the repository.
Switch to the develop branch:
```
git checkout develop
```
Install dependencies:
```
$ npm install
```

Testing

To run tests:

$ npm run test

To run tests in watch mode:

$ npm run test:watch

To run test coverage:

$ npm run test:cov

Building

To run deployment builds:

$ npm run build

Deploying

Push the changes to develop.
Merge to master via pull request.

Travis CI will build and deploy the new version of the package (based on semantic commits) to NPM.

Acknowledgements

Ezo Saleh, original author of this package
The tabula-java team
tabula

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

fresh-tabula-js

Contents

Getting Started

Requirements

Installing

Usage

API

Options

Methods

Tabula.getData

Tabula.streamSections

Tabula.stream

Developing

Introduction

Branch Usage

Commit Message Convention

Installing

Testing

Building

Deploying

Acknowledgements

`Tabula.getData`

`Tabula.streamSections`

`Tabula.stream`