@epilogo/unstructured-io-node
v2.0.6
Published
Node bindings for unstructured.io
Downloads
17
Readme
@epilogo/unstructured-io-node
- Current Hash
6ba376ab7eaa73e12be35438bae47cfa0ca7dfe5
- Current Version:
0.15.14
https://github.com/Unstructured-IO/unstructured/commit/6ba376ab7eaa73e12be35438bae47cfa0ca7dfe5
To release a new version
- Replace all
6ba376ab7eaa73e12be35438bae47cfa0ca7dfe5
with the new hash - Replace all
0.15.14
with the new version tag - Run
./scripts/install.sh
This library provides Node.js bindings to the unstructured.io
Python module. It enables Node.js applications to utilize the document parsing capabilities of the unstructured
library.
Motivation
The unstructured
Python library excels at extracting structured data from various document formats, including PDFs, HTML, Word documents, and more. However, there are situations where utilizing this functionality within a Node.js environment directly is desirable. This library bridges that gap, allowing Node.js applications to leverage the power of unstructured
without relying solely on a Python environment.
Installation
You can install the library using npm, yarn, or pnpm:
npm install @epilogo/unstructured-io-node
yarn add @epilogo/unstructured-io-node
pnpm add @epilogo/unstructured-io-node
Post-installation
The post-installation script (install.sh
) will execute the following:
- Clone the
unstructured-io
repository: It clones a specific commit6ba376ab7eaa73e12be35438bae47cfa0ca7dfe5
repository into apython/unstructured-io
directory within the library's folder. - Install system dependencies: Based on your operating system (currently supporting Linux and macOS), it installs the necessary system packages for
unstructured
to function. These packages include tools for image processing, OCR, and handling various document formats. - Create and activate a Python virtual environment: It creates a virtual environment within the
python
directory to isolate the Python dependencies of this library. - Install Python dependencies: It installs the
unstructured
Python package along with its dependencies within the activated virtual environment. - Install Node.js and pnpm:
Usage
Here's a basic example demonstrating how to use the library:
import { UnstructuredIO } from '@epilogo/unstructured-io-node';
import * as path from 'path';
// This must be called at least once in your container or local environment
// It takes care of installing the neccesary dependencies.
// Only macOS and Linux is supported
await UnstructuredIO.ensureEnvironmentSetup();
const partitioned = await UnstructuredIO.partition({
filename: path.join(__dirname, '../__tests__/data/your-document.pdf'),
strategy: 'hi_res',
languages: ['eng'],
});
console.log(partitioned);
Explanation:
- Import: Import the
UnstructuredIO
object from the library. - Call
partition
: Invoke thepartition
function on theUnstructuredIO
object, passing an options object as an argument.filename
: Specify the path to the document you want to process.strategy
,languages
, etc.: Adjust other options as needed based on theunstructured
Python library's API.
- Process the result: The
partition
function returns the extracted structured data, which you can then process according to your application's needs.
Refer to the unstructured
library's documentation (https://docs.unstructured.io/) for details on available options and their usage within the partition
function.