textract.apporoad

v10002.4.0

Published

4 years ago

Extracting text from files of various type including html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf, text/*, and various open office.

0High
0Medium
0Low

apporoad

textract extract html csv text pdf docx doc xls xlsx png jpg gif rtf dxf pptx html markdown xml odt ott xlsb xlsm xltx ods ots potx odg otg epub

textract

A text extraction node module.

Currently Extracts...

HTML, HTM
ATOM, RSS
Markdown
EPUB
XML, XSL
PDF
DOC, DOCX
ODT, OTT (experimental, feedback needed!)
RTF
XLS, XLSX, XLSB, XLSM, XLTX
CSV
ODS, OTS
PPTX, POTX
ODP, OTP
ODG, OTG
PNG, JPG, GIF
DXF
application/javascript
All text/* mime-types.

In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.

Does textract not extract from files of the type you need? Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.

Install

npm install textract

Extraction Requirements

Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable. Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.

PDF extraction requires pdftotext be installed, link
DOC extraction requires antiword be installed, link, unless on OSX in which case textutil (installed by default) is used.
RTF extraction requires unrtf be installed, link, unless on OSX in which case textutil (installed by default) is used.
PNG, JPG and GIF require tesseract to be available, link. Images need to be pretty clear, high DPI and made almost entirely of just text for tesseract to be able to accurately extract the text.
DXF extraction requires drawingtotext be available, link

Configuration

Configuration can be passed into textract. The following configuration options are available

preserveLineBreaks: When using the command line this is set to true to preserve stdout readability. When using the library via node this is set to false. Pass this in as true and textract will not strip any line breaks.
preserveOnlyMultipleLineBreaks: Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option (default false) is set to true, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks.
exec: Some extractors (dxf) use node's exec functionality. This setting allows for providing config to exec execution. One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the exec maxBuffer setting.
[ext].exec: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, the odt extractor is what you would configure for odt and odg/odt etc. Check the extractors to see which you want to specifically configure. At the bottom of each is a list of types for which the extractor is responsible.
tesseract.lang: A pass-through to tesseract allowing for setting of language for extraction. ex: { tesseract: { lang:"chi_sim" } }
tesseract.cmd: tesseract.lang allows a quick means to provide the most popular tesseract option, but if you need to configure more options, you can simply pass cmd. cmd is the string that matches the command-line options you want to pass to tesseract. For instance, to provide language and psm, you would pass { tesseract: { cmd:"-l chi_sim -psm 10" } }
pdftotextOptions: This is a proxy options object to the library textract uses for pdf extraction: pdf-text-extract. Options include ownerPassword, userPassword if you are extracting text from password protected PDFs. IMPORTANT: textract modifies the pdf-text-extract layout default so that, instead of layout: layout, it uses layout:raw. It is not suggested you modify this without understanding what trouble that might get you in. See this GH issue for why textract overrides that library's default.
typeOverride: Used with fromUrl, if set, rather than using the content-type from the URL request, will use the provided typeOverride.
includeAltText: When extracting HTML, whether or not to include alt text with the extracted text. By default this is false.

To use this configuration at the command line, prefix each open with a --.

Ex: textract image.png --tesseract.lang=deu

Usage

Commmand Line

If textract is installed gloablly, via npm install -g textract, then the following command will write the extracted text to the console for a file on the file system.

$ textract pathToFile

Flags

Configuration flags can be passed into textract via the command line.

textract pathToFile --preserveLineBreaks false

Parameters like exec.maxBuffer can be passed as you'd expect.

textract pathToFile --exec.maxBuffer 500000

And multiple flags can be used together.

textract pathToFile --preserveLineBreaks false --exec.maxBuffer 500000

Node

Import

var textract = require('textract');

APIs

There are several ways to extract text. For all methods, the extracted text and an error object are passed to a callback.

error will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a typeNotFound flag will be tossed on the error object.

File

textract.fromFileWithPath(filePath, function( error, text ) {})

textract.fromFileWithPath(filePath, config, function( error, text ) {})

File + mime type

textract.fromFileWithMimeAndPath(type, filePath, function( error, text ) {})

textract.fromFileWithMimeAndPath(type, filePath, config, function( error, text ) {})

Buffer + mime type

textract.fromBufferWithMime(type, buffer, function( error, text ) {})

textract.fromBufferWithMime(type, buffer, config, function( error, text ) {})

Buffer + file name/path

textract.fromBufferWithName(name, buffer, function( error, text ) {})

textract.fromBufferWithName(name, buffer, config, function( error, text ) {})

URL

When passing a URL, the URL can either be a string, or a node.js URL object. Using the URL object allows fine grained control over the URL being used.

textract.fromUrl(url, function( error, text ) {})

textract.fromUrl(url, config, function( error, text ) {})

Testing Notes

Running Tests on a Mac?

sudo port install tesseract-chi-sim
sudo port install tesseract-eng
You will also want to disable textract's usage of textutil as the tests are based on output from antiword.
- Go into /lib/extractors/{doc|doc-osx|rtf} and modify the code under if ( os.platform() === 'darwin' ) {. Uncommented the commented lines in these sections.

Release Notes

2.4.0

#164. Fixed issue with extra text nodes in odt/ott extraction.
#156. Introduced preserveOnlyMultipleLineBreaks feature.
#149. RTF extraction error error fixed by #166.
#145. Handling Japanese full-width characters.
#106. Now extracting .epub

2.3.0

#149. Fixed a few text errors that had cropped up with previous PRs/library updates
#139. Updated mime and marked libraries because of GitHub vulnerability warnings
#137. Added ability to capture HTML alt text via includeAltText option.

2.2.0

#118. Properly extracting horizontal bar character
#119. Passing exec options into RTF extraction.
#119. Preserving № character.
#122. Passing exec options into DOC extraction.
#123. Adding ATOM and RSS extraction.
#128. Handle line break preservation properly in .docx extractor

2.1.2

#114. Not stripping Microsoft dashes.
#116. Better handling image binary check.

2.1.1

#111. Callback was being called two times when URL errored out.
#112. PR added handling errors returned by decoding text files.

2.1.0

Updated all dependencies to latest, except for got, which was updated, but not to the latest because of lack of support for older node versions.
#93. PR added better error handling for fromUrl requests.
#95. PR added support for monetary symbols.
#96. Fixed various issues with doc handling on Windows.
#97, #102. Added ability to provide raw node.js URL object to the fromUrl call which bypasses URL parsing/mangling.
#98. PR shortened needlessly long file paths for temp files.
#99. Now handling Chinese comma.
#101. PR added UTF-8 support for antiword requests.
#105. Added tesseract.cmd option which allows for providing an exact tesseract command-line string.
#109. Properly handle RTF files with spaces in the name on OSX

2.0.0

Codebase is now properly eslinted.
Fixed testing issue, .csv was .gitignored preventing .csv test file from making into repo.
#57, #75. Added a pdftotextOptions in textract options. This is a proxy to the pdf-text-extract options.
#69. Escaping paths for all exec and spawn.
#74. PR fixing fancy double quotes -> “.
#77. PR fixes decoding of non-utf8 encoded files.
#78. Force all mime types to lowercase for comparison.
#81. Moved .doc (old MSWord) extraction to antiword from catdoc. catdoc is no longer supported on OSX making it extremely difficult for me to support updates that require testing of .doc files. One major difference that'll be seen with .docs of certain types is explained here. If "I'm afraid the text stream of this file is too small to handle." is an error message you see, see that post.
#82, #83. PR updated cheerio to fix a cheerio regression.
Fixed regression issue with above two PRs in combination. Pure text/* extraction left encoded characters for stylized quotes and true elipsis in the text.
#88. PR fixed detection/messaging of missing binaries for .doc, images and .pdf.
#89. PR returned textract to using j as a module rather than a binary.
#90. PR improved content type detection when extracting from URLs. Also updated tests to pull test files using proper content-type.

1.2.1

#68. PR captured unzip errors.

1.2.0

#66. textract will no longer put the info text to stdout about the extractors not being available or installed correctly. Instead, if you attempt to use a supported extractor that did not initialize correctly, you will get an updated error message indicating that the type is supported by textract but that external dependencies were not located. As part of this update, error messages were updated a bit to list both the type and the file.
#65. Fixed issue where for .odt and .docx files with varying non-Latin characters (ex: cyrillic) were being stripped entirely of their content.

1.1.2

#63. PR added support for CSV.

1.1.1

#58/#59. PR fixed issue with removing line breaks when more than 1 break present.

1.1.0

#53. Cleared up documentation around CLI and line breaks.
#54. PR removed disableCatdocWordWrap as an option, instead always disabling catdoc's word wrapping.
#55. PR removed clobbering of non-boolean flags on CLI.

1.0.4

#52. PR fixed CLI post big API changes.

1.0.3

#51. Fixed issue with large files using unzip returning blank string.

1.0.1/1.0.2

#49 Updated messages when extractors are not available to be purely informational, since textract will work just fine without some of its extractors.
#50. Updated way in which catdoc was detected to not rely on file being test extracted.

1.0.0

Overhaul of interface. To simplify the code, the original textract function was broken into textract.fromFileWithPath and textract.fromFileWithMimeAndPath.
#41. Added support for pulling files from a URL.
#40. Added support for extracting text from a node Buffer. This prevents you from having to write the file to disk first. textract does have to write the file to disk itself, but because it is a textract requirement that files be on disk textract should be able to take care of that for you. Two new functions, textract.fromBufferWithName and textract.fromBufferWithMime have been added. textract needs to either know the file name or the mime type to extract a buffer.
Added entity decoding, so encoded items like <, >, ", ', and & will show up appropriately in the text.
Removed external dependency on unzip
#38. Added markdown support.
#31. Added initial ODT support. Feedback needed if there is any trouble. Also added OTT support.
Added support for ODS, OTS.
Added support for XML, XSL.
Added support for POTX.
Added support for XLTX, XLTS.
Added support for ODG, OTG.
Added support for ODP, OTP.

0.20.0

Pull Request #39 added support for not work wrapping with catdoc.

0.19.0

#30, #34. The command line has been improved, allowing for all the configuration options to be provided.

0.18.0

#36 Fixed error with previous deploy.
#32 Fixed docx line break issue.

0.17.0

Updated character stripping regex to be more lenient.

0.16.0

Added HTML extraction.
Added ability for extractors to register for specific extensions (not yet used). This handles cases where extensions (like .webarchive) do not have recognized mime types.

0.15.0

Addressed some lingering regex issues from previous release.
Added tests for RTF, more tests for DOC
#29 Introduced new extractor for .doc and .rtf for OSX only. All non-OSX operating systems will continue to use catdoc. Going forward, because of issues getting catdoc installed on OSX, on OSX only textutil will be used. textutil comes default installed with OSX.

0.14.0

#29 which resulted in the following changes:

writing info messages to stderr when extractors taking awhile to get going
no longer removing …
centralized some cleansing regexes, also no longer removing multiple back to back spaces using \s as it was removing any back to back newlines. Now scoping back to back replacing to [\t\v\u00A0].

0.13.2

#27, addressed issues with page ordering in pptx extraction.

0.13.1

#25, added language support for tesseract, see tesseract.lang property.
Updated regex that strips bad characters to not strip (some) chinese characters. The regex will likely need updating by someonw more familiar with Chinese. =)

0.13.0

#26, using os.tmpdir() rather than a temp dir inside textract.
Upgraded to latest j (dependency)
Removed macProcessGif option and tests as tesseract seems to work on Mac just fine now

0.12.0

#21, #22, Now using j via its binaries rather than using it via node. This makes XLS/X extraction slower, but reduces memory consumption of textract signifcantly.

0.11.2

Updated pdf-text-extract to latest, fixes #20.

0.11.1

Addressed path escaping issues with tesseract, fixes [#18] (https://github.com/dbashford/textract/issues/18)

0.11.0

Using j to handle xls and xlsx, this removes the requirement on the xls2csv binary.
j also supports xlsb and xlsm