@datagica/read-document
v0.1.2
Published
Extract plain text from any kind of document
Downloads
3
Readme
@datagica/read-document
Extract plain text from any kind of document. Based on textract
.
Current issues
read-document
is not thread safe (because it uses textract
, and textract
is
not apparently), so you will have to wait for each promise to complete before
converting another document, for instance by chaining promises like this:
const read = require('@datagica/read-document');
const sequentialPromise = files.reduce((p, file) =>
p.then(done =>
read({ file: file }).then(doc => anotherAsyncPromise(doc))
),
Promise.resolve(0)
)
Prerequisites
- PDF extraction requires
pdftotext
be installed - DOC, RTF extraction requires
catdoc
be installed, unless on OSX in which casetextutil
(installed by default) is used. - PNG, JPG and GIF require
tesseract
to be available. Images need to be pretty clear, high - DPI and made almost entirely of just text for tesseract to be able to accurately extract the text. - DXF extraction requires
drawingtotext
be available