uniparser

v1.3.2

Published

a month ago

A universal parser for PDF, DOCX, TXT, MD, and HTML files into text

Downloads

0High
0Medium
0Low

amkra

parser pdf docx txt html markdown file-parser text-extractor

📜 UniParser: Universal File Parsing for Node.js

UniParser is a powerful, lightweight Node.js library designed to handle parsing of multiple file formats—such as PDF, DOCX, TXT, HTML, and Markdown—and convert them into plain text with ease.

🚀 Say goodbye to file format limitations! UniParser extracts text content from all these formats, providing a consistent text output for your applications.

✨ Features

🔍 PDF Parsing: Extracts plain text from PDF documents.
📝 DOCX Parsing: Reads and extracts text from Microsoft Word .docx files.
📄 TXT Parsing: Handles plain text files with no special formatting.
🌐 HTML Parsing: Extracts text from the body of HTML documents.
🎨 Markdown Parsing: Converts Markdown files to plain text, stripping out all formatting syntax.
🔄 Auto-detection: Automatically detects the file format and parses it using the autoParse function.

📦 Installation

To install UniParser, simply run:

npm install uniparser

🛠️ Usage

CommonJS (CJS) Example

If you’re working in a Node.js environment with CommonJS (CJS), use require() to import UniParser:

const { autoParse, parsePDF, parseDOCX, parseTXT, parseHTML, parseMarkdown } = require('uniparser');

// Example: Automatically detect and parse a file
(async () => {
    const parsedText = await autoParse('./path/to/sample-file.pdf');
    console.log(parsedText);
})();

// Example: Parse specific file types
const pdfText = await parsePDF('./path/to/sample-file.pdf');
const docxText = await parseDOCX('./path/to/sample-file.docx');
const txtText = parseTXT('./path/to/sample-file.txt');
const htmlText = parseHTML('./path/to/sample-file.html');
const markdownText = parseMarkdown('./path/to/sample-file.md');

ES Modules (ESM) Example

If you’re working in an ES Module environment (modern JavaScript), use import to load the functions:

import { autoParse, parsePDF, parseDOCX, parseTXT, parseHTML, parseMarkdown } from 'uniparser';

// Example: Automatically detect and parse a file
(async () => {
    const parsedText = await autoParse('./path/to/sample-file.pdf');
    console.log(parsedText);
})();

// Example: Parse specific file types
const pdfText = await parsePDF('./path/to/sample-file.pdf');
const docxText = await parseDOCX('./path/to/sample-file.docx');
const txtText = parseTXT('./path/to/sample-file.txt');
const htmlText = parseHTML('./path/to/sample-file.html');
const markdownText = parseMarkdown('./path/to/sample-file.md');

⚡ Synchronous Usage (for small files)

For small files, you can use UniParser synchronously, but this should only be done for very lightweight files.

CommonJS (CJS):

const { parseTXT, parseMarkdown } = require('uniparser');

// Synchronously read small text files
const txtContent = parseTXT('./path/to/sample-file.txt');
console.log(txtContent);

const markdownContent = parseMarkdown('./path/to/sample-file.md');
console.log(markdownContent);

ES Modules (ESM):

import { parseTXT, parseMarkdown } from 'uniparser';

// Synchronously read small text files
const txtContent = parseTXT('./path/to/sample-file.txt');
console.log(txtContent);

const markdownContent = parseMarkdown('./path/to/sample-file.md');
console.log(markdownContent);

🔗 Supported File Formats

📄 PDF (.pdf): Converts PDF documents to plain text.
📝 DOCX (.docx): Extracts text from Microsoft Word .docx files.
🖋️ TXT (.txt): Reads plain text from simple text files.
🌐 HTML (.html): Strips HTML tags and returns the text content.
✍️ Markdown (.md): Converts Markdown files to plain text, removing all formatting.
🔄 Auto-detection: Detects file types automatically via autoParse and processes them accordingly.

🎯 Example

Here's a quick example to get you started with DOCX parsing:

CommonJS (CJS):

const { parseDOCX } = require('uniparser');

(async () => {
    const docxText = await parseDOCX('./path/to/sample-file.docx');
    console.log(docxText);
})();

ES Modules (ESM):

import { parseDOCX } from 'uniparser';

(async () => {
    const docxText = await parseDOCX('./path/to/sample-file.docx');
    console.log(docxText);
})();

🔑 License

This project is licensed under the MIT License. See the LICENSE file for more information.

🤝 Contributing

Contributions are welcome! If you'd like to improve UniParser, feel free to fork the repository and submit a pull request. We appreciate your feedback and contributions!

💡 UniParser makes it easier than ever to extract content from a wide range of file formats—Try it now and streamline your file processing tasks! 🌟

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

📜 UniParser: Universal File Parsing for Node.js

✨ Features

📦 Installation

🛠️ Usage

CommonJS (CJS) Example

ES Modules (ESM) Example

⚡ Synchronous Usage (for small files)

CommonJS (CJS):

ES Modules (ESM):

🔗 Supported File Formats

🎯 Example

CommonJS (CJS):

ES Modules (ESM):

🔑 License

🤝 Contributing