fresh-tabula-js
v2.0.0
Published
Extract CSV data from PDF tables using tabula-java.
Downloads
450
Maintainers
Readme
fresh-tabula-js
Convert tables inside PDFs to CSV via
tabula-java
using JavaScript.
This is a maintained fork of the tabula-js
package,
with changes such as:
- Non-stream asynchronous extraction (use
async
/await
)
Please submit any issues (or e-mail me).
Contents
Getting Started
Only Node.js environments are supported due to file-system usage requirements. The package is exported as a CommonJS module.
Requirements
- Java Development Kit (JDK) with
java
available via command-line - Node.js/npm
Installing
To install as a dependency via npm
:
$ npm install --save fresh-tabula-js
Usage
Import the module:
// 1. Import the module
const Tabula = require('fresh-tabula-js');
const extractData = async () => {
// 2. Instantiate a table via passing a path to a PDF (this can be relative or absolute)
const table = new Tabula('data/foobar.pdf');
// 3. Call an extraction method
return await table.getData();
};
// 4. Call the method!
const data = extractData();
API
First, an instance of Tabula must be instantiated via calling tabula
with a path (relative or absolute) to a valid PDF.
Example:
const Tabula = require('fresh-tabula-js');
const table = new Tabula('path/to/pdf/foobar.pdf');
// Do stuff
Options
All extraction methods support the same set of options.
Options are passed through to tabula-java
with some exceptions, such as the inability to write the output to file (-o
). Extracted data is available through callbacks, streams, and return values.
Options are structured as a plain object.
| Key | Type | Default | Description |
| - | - | - | - |
| area
| String or Array | Entire page | Co-ordinates of the portion(s) of the page to analyze, formatted in strings in the following format top,left,bottom,right
. For example, 269.875,12.75,790.5,561
or ["269.875,12.75,790.5,561", "132.45,23.2,256.3,534"]
.
| columns
| String | none | X coordinates of column boundaries. Example "10.1,20.2,30.3"
|
| debug
| Boolean | false
| Print detected table areas instead of processing them. |
| guess
| Boolean | true
| Guess the portion(s) of the page to analyze and process. |
| silent
| Boolean | false
| Suppresses all stderr
output from the tabula-java
JAR only. JavaScript errors will still be logged. |
| noSpreadsheet
| Boolean | false
| Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet). |
| pages
| String | 1
| Comma separated list of ranges, or all
. E.g. 1-3,5-7
, 3
, all
.
| spreadsheet
| Boolean | false
| Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet). |
| password
| String | empty | Password used to decrypt/access the document. |
| useLineReturns
| Boolean | false
| Use embedded line returns in cells (only in spreadsheet mode). |
Methods
Tabula.getData
Use this method to process extracted data from PDF asynchronously using async
/await
.
It returns an object in the following format:
{
output: <String>,
error: <String>,
}
Example:
const Tabula = require('fresh-tabula-js');
const data = async () => {
const table = new Tabula('dir/foobar.pdf');
return await table.getData();
};
Tabula.streamSections
Use this method to process extracted data in sections (separate tables).
Callbacks will be executed for each parsed section of the PDF.
Extracted data is a string representing an array of all rows (in CSV format) found, including headers.
const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf');
table.streamSections((err, data) => console.log(data));
We can use the area
option to analyze specific portions of the document.
const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf', {
area: "269.875,150,690,545",
});
table.streamSections((err, data) => console.log(data));
Tabula.stream
This is used to process data from PDFs via streams.
Example:
const Tabula = require('fresh-tabula-js');
new Tabula('dir/foobar.pdf')
.stream()
.pipe(process.stdout);
The underlying library is built on streams using Highland.js.
This means the returned stream can perform highland-js
-style transformations and operations.
Example:
const Tabula = require('fresh-tabula-js');
const stream = new Tabula('dir/foobar.pdf')
.stream();
stream.split()
.doto(console.log)
.done(() => console.log('All done!'));
Developing
Introduction
Branch Usage
Development is done in the develop
branch.
When master
changes (e.g. via pull request), Travis CI
will build and deploy a new version of the package using semantic versioning based on commit messages
to determine the version type.
Commit Message Convention
Commit messages must be formatted according to the conventional commits Angular spec:
<type>[optional scope]: <description>
[optional body]
[optional footer]
The following types are supported:
- build: Changes that affect the build system or npm dependencies
- ci: Changes to CI config (e.g. Travis CI config changes)
- docs: Documentation-only changes
- feat: New features
- fix: Bug fix
- perf: Code change related to performance
- refactor: A code change that neither fixes a bug nor adds a feature
- style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
- test: Adding missing tests or correcting existing tests
Rules configuration is found in in release.config.js
.
Installing
Clone the repository.
Switch to the
develop
branch:git checkout develop
Install dependencies:
$ npm install
Testing
To run tests:
$ npm run test
To run tests in watch mode:
$ npm run test:watch
To run test coverage:
$ npm run test:cov
Building
To run deployment builds:
$ npm run build
Deploying
- Push the changes to
develop
. - Merge to
master
via pull request.
Travis CI will build and deploy the new version of the package (based on semantic commits) to NPM.
Acknowledgements
- Ezo Saleh, original author of this package
- The tabula-java team
- tabula