pdftoc

v0.1.1

Published

5 months ago

Creates Table of Contents in PDF documents

Downloads

0High
0Medium
0Low

jpilgrim

PDF TOC

Pdftoc

Creates Table of Contents in PDF documents

Usage

Clone this project
Run npm run build
Run tool via npx pdftoc

Recipes

Recipes describe how to detect headings in the PDF. There are basically two strategies, which may be combined:

By Font: Headings are often rendered with a unique font and can therefor detected by font.
By Pattern: Headings may also be detected by reqular expressions, e.g., they often start with numbers like "2.3.1".

The recipe file is a json file consisting of an array of recipies. Each recipe may contain the following properties, of which all except level are optional.

fontName: the name of the font as used in the PDF file
fontSizeFrom: The minimum size of the font (in points), e.g. 30
fontSizeTo: The maximums size of the font (in points), if not defined, the from size is used here as well
bold: true or false
italic: true or false
fillRGBColor: Array of three numbers with the RGB colors (from 0 to 255), e.g. [200,0,0] for red.
regexp: Regular expression applied on text, e.g. "^\d+\.\d+\.\d+\s+.*". Note that backslashes are to be escaped! The regexp could either be a single regex or an array. In the latter case, the regexes are applied to the first text matching the font filters and all succeeding texts (without checking the fonts).
level: The level, starting from 1 to whatever.

Hint: If you use a regular expressions, it might happen that the table of content of the document itself duplicates all heading. In that case, set pages accordingly, e.g. pages="5-", so that the table of content pages are ignored.

Examples

Detect headings based on font:

[
    { "fontName": "BHTCaseMicro", "fontSizeFrom": 64, 
        "bold": true, "italic": false, "level": 1 },
    { "fontName": "BHTCaseMicro", "fontSizeFrom": 60, 
        "bold": true, "italic": false, "level": 2 },
    { "fontName": "BHTCaseMicro", "fontSizeFrom": 52, 
        "bold": true, "italic": false, "level": 3 },
    { "fontName": "BHTCaseMicro", "fontSizeFrom": 48, 
        "bold": true, "italic": false, "level": 4 },
]

Detect heading based on regular expression:

[
    { "regexp": "^\\d+\\s+.*", "level": 1 },
    { "regexp": "^\\d+\\.\\d+\\s+.*",  "level": 2 },
    { "regexp": "^\\d+\\.\\d+\\.\\d+\\s+.*", "level": 3 },

]

Tip: Using regular expressions with "bold": true works in many cases :-)

Implementation Notes

PDF.js

Documentation of operations: https://github.com/MeiKatz/pdfjs-docs
PDF 2.0 Specification: https://developer.adobe.com/document-services/docs/assets/5b15559b96303194340b99820d3a70fa/PDF_ISO_32000-2.pdf
PDF 1.7 Specification: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf

Modules

This package relies on PDF.js, which is added as dependency (pdfjs-dist). This package uses ES modules, and as a consequence ES modules are used here as well. There are some problems with that, in particular with TypeScript:

TypeScript files either need a file extension .mts or, as done here, we defined type="module" in package.json
TS-Jest works with ES modules, but it is a bit of a hassle. See https://jestjs.io/docs/ecmascript-modules and https://kulshekhar.github.io/ts-jest/docs/guides/esm-support/#support-mts-extension

As noted at the PDF.js-FAQ, we need to import the legacy build of PDF.js.

Status

In development. Works with some files, sometimes font is not detected. In order to use fonts in settings, usually access to original document with font information is required (as the analysis is not really usable at the moment). But using regex works quite well even for unknown documents.

Probably need a GUI sooner or later ;-)

License

This program and the accompanying materials are made available under the terms of the Eclipse Public License v. 2.0 which is available at https://www.eclipse.org/legal/epl-2.0.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme