documentspark

v1.6.2

Published

2 years ago

documentspark :: document -> picture_of_page[]

Downloads

0High
0Medium
0Low

dosy

pdf image

:sparkling_heart: DocumentSpark

Simple secure document viewing server. Used by Viewfinder™

Converts a document to a picture of its pages. View a document from the internet without downloading or running it on your machine, and without needing a word processor, spreadsheet app, or PDF viewer installed. This provides content disarm and reconstruction, or CDR. Also known as p2., this code is deployed commercially by Dosyago in their ViewFinder cloud browser product.

From the comments

This is a very simple server in NodeJS to accept a document upload (or a URL) and convert that document (using ImageMagick, LibreOffice and GhostScript) into a series of images, one for each page of the document.
The point was originally to allow people to view documents securely (such as email attachments) without needing to run nor download said document to their own devices. It was successful in doing that, but its use grew to becoming ad-hoc document hosting where people were attracted to the ability to access a page of a document, without needing to download the entire document.
The code is shared as something you can build upon and adapt to your uses in the open. It's not meant as a finished solution, it's meant as a starting point, something to give you ideas for how to implement your own version, or something to plug in to your own open-source work. The project was originally called "p2." for "PDF to ...", but it works on a wide range of source documents, including DOCX and (often but not always) XLSX, and so on. It doesn't work on HTML or TXT.

Use it

$ git clone https://github.com/dosyago/documentspark.git
$ cd documentspark
$ ./scripts/setup.sh 
$ ./scripts/restart.sh

Or:

$ npm i documentspark@latest
$ cd node_modules/documentspark
$ ./scripts/setup.sh 
$ ./scripts/restart.sh

If you have SSL certs in $HOME/sslcerts/ these will be used (including mkcert localhost certs!), if not the server will run on HTTP. It will run under pm2 and default to port 443. You can supply a custom port with npm start <PORT>.

Navigate to yourserver:your_port/secretpage-canneverbefound.html to convert a document. You can input either a file, or a URL. It may not always be possible to obtain a document from the URL.

Document view pages are not protected by any authentication, they are simply chosen pseudo-randomly. You can modify the code to give document viewing pages longer, more securely random URLs.

By default, converted documents are cleaned out after 3 days. You can change this in /public/uploads/clean.sh which runs every few minutes and cleans any documents older than 4319 minutes (roughly 3 days).

Make it an API

There's a very simple "master key" secret parameter sent with the POST request. You can call this POST endpoint via a secure HTTPS API (using multitype/form encoding) and pass your custom secret= as a parameter to authorize the conversion.

System Requirements

You need a beefy machine. 4 cores, with 8 GB RAM for most documents. But more is better. Smaller machines will routinely run out of memory or take a long time when running the libreoffice, imagemagick and gs jobs.

Improving perf

You can try recompiling ImageMagick to have multicore support. I found this significantly improves performance.

Thanks to*

*No affiliation

License

Licensed under AGPL-3.0.

If you'd like to deploy this in your org without going open-source or for a for-profit project where youd want to release the source under AGPL-3.0 as well, write me ([email protected]) about a license exemption.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme