pdftoc
v0.1.1
Published
Creates Table of Contents in PDF documents
Downloads
3
Readme
Pdftoc
Creates Table of Contents in PDF documents
Usage
- Clone this project
- Run
npm run build
- Run tool via
npx pdftoc
Recipes
Recipes describe how to detect headings in the PDF. There are basically two strategies, which may be combined:
- By Font: Headings are often rendered with a unique font and can therefor detected by font.
- By Pattern: Headings may also be detected by reqular expressions, e.g., they often start with numbers like "2.3.1".
The recipe file is a json file consisting of an array of recipies. Each recipe may contain the following properties, of which all except level are optional.
- fontName: the name of the font as used in the PDF file
- fontSizeFrom: The minimum size of the font (in points), e.g. 30
- fontSizeTo: The maximums size of the font (in points), if not defined, the from size is used here as well
- bold: true or false
- italic: true or false
- fillRGBColor: Array of three numbers with the RGB colors (from 0 to 255), e.g. [200,0,0] for red.
- regexp: Regular expression applied on text, e.g. "^\d+\.\d+\.\d+\s+.*". Note that backslashes are to be escaped! The regexp could either be a single regex or an array. In the latter case, the regexes are applied to the first text matching the font filters and all succeeding texts (without checking the fonts).
- level: The level, starting from 1 to whatever.
Hint: If you use a regular expressions, it might happen that the table of content of the document itself duplicates all heading. In that case, set pages accordingly, e.g. pages="5-", so that the table of content pages are ignored.
Examples
Detect headings based on font:
[
{ "fontName": "BHTCaseMicro", "fontSizeFrom": 64,
"bold": true, "italic": false, "level": 1 },
{ "fontName": "BHTCaseMicro", "fontSizeFrom": 60,
"bold": true, "italic": false, "level": 2 },
{ "fontName": "BHTCaseMicro", "fontSizeFrom": 52,
"bold": true, "italic": false, "level": 3 },
{ "fontName": "BHTCaseMicro", "fontSizeFrom": 48,
"bold": true, "italic": false, "level": 4 },
]
Detect heading based on regular expression:
[
{ "regexp": "^\\d+\\s+.*", "level": 1 },
{ "regexp": "^\\d+\\.\\d+\\s+.*", "level": 2 },
{ "regexp": "^\\d+\\.\\d+\\.\\d+\\s+.*", "level": 3 },
]
Tip: Using regular expressions with "bold": true
works in many cases :-)
Implementation Notes
PDF.js
- Documentation of operations: https://github.com/MeiKatz/pdfjs-docs
- PDF 2.0 Specification: https://developer.adobe.com/document-services/docs/assets/5b15559b96303194340b99820d3a70fa/PDF_ISO_32000-2.pdf
- PDF 1.7 Specification: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
Modules
This package relies on PDF.js, which is added as dependency (pdfjs-dist). This package uses ES modules, and as a consequence ES modules are used here as well. There are some problems with that, in particular with TypeScript:
- TypeScript files either need a file extension .mts or, as done here, we defined type="module" in package.json
- TS-Jest works with ES modules, but it is a bit of a hassle. See https://jestjs.io/docs/ecmascript-modules and https://kulshekhar.github.io/ts-jest/docs/guides/esm-support/#support-mts-extension
As noted at the PDF.js-FAQ, we need to import the legacy build of PDF.js.
Status
In development. Works with some files, sometimes font is not detected. In order to use fonts in settings, usually access to original document with font information is required (as the analysis is not really usable at the moment). But using regex works quite well even for unknown documents.
Probably need a GUI sooner or later ;-)
License
This program and the accompanying materials are made available under the terms of the Eclipse Public License v. 2.0 which is available at https://www.eclipse.org/legal/epl-2.0.