npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@dmitryrechkin/text-extractor

v1.0.1

Published

**Text Extractor is a TypeScript library designed for extracting text from various file formats including DOCX, PDF, and images.** This library provides a unified interface to handle different file types, making it easier to retrieve text content regardle

Downloads

26

Readme

Text Extractor

Text Extractor is a TypeScript library designed for extracting text from various file formats including DOCX, PDF, and images. This library provides a unified interface to handle different file types, making it easier to retrieve text content regardless of the format.

Installation

Install the package using pnpm:

pnpm add @dmitryrechkin/text-extractor

Features

  • DOCX Support: Extracts raw text from DOCX files using `mammoth`.
  • PDF Support: Extracts text from PDF files using `pdfjs-dist`.
  • Image OCR: Extracts text from images using the OCR.space API.
  • Automatic Format Detection: Automatically selects the appropriate extractor based on the input file.

Usage

Extracting Text from a DOCX File

import { DocTextExtractor } from "@dmitryrechkin/text-extractor";

const docxExtractor = new DocTextExtractor();
const text = await docxExtractor.extractText(docxBuffer);

console.log(text);
// Output: "Extracted text from the DOCX file..."

Extracting Text from a PDF File

import { PDFTextExtractor } from "@dmitryrechkin/text-extractor";

const pdfExtractor = new PDFTextExtractor();
const text = await pdfExtractor.extractText(pdfBuffer);

console.log(text);
// Output: "Extracted text from the PDF file..."

Extracting Text from an Image

import { ImageTextExtractor } from "@dmitryrechkin/text-extractor";

const ocrOptions = {
    apiKey: "your-ocr-space-api-key",
    language: "eng"
};

const imageExtractor = new ImageTextExtractor(ocrOptions);
const text = await imageExtractor.extractText(imageBuffer);

console.log(text);
// Output: "Extracted text from the image..."

Using the Text Extractors Manager

import { TextExtractorsManager, DocTextExtractor, PDFTextExtractor, ImageTextExtractor } from "@dmitryrechkin/text-extractor";

const manager = new TextExtractorsManager([
    new DocTextExtractor(),
    new PDFTextExtractor(),
    new ImageTextExtractor({ apiKey: "your-ocr-space-api-key" })
]);

const text = await manager.extractText(fileBuffer);

console.log(text);
// Output: "Extracted text from the file..."

When to Use

This library is ideal for projects that require text extraction from various document formats, such as:

  • Document Processing Pipelines: Automatically extract text from documents for indexing, search, or further processing.
  • OCR Tasks: Convert images into text using a reliable OCR service.
  • Unified Text Extraction: Manage multiple file formats with a single, unified interface.

Installation & Setup

Install the package using pnpm:

pnpm add @dmitryrechkin/text-extractor

Ensure that your project is set up to handle TypeScript and supports ES modules, as this library is built with modern JavaScript standards.

Contributing

Contributions are welcome! Feel free to fork this project and submit pull requests. Before submitting, please ensure your code passes all linting and unit tests.

You can run unit tests using:

pnpm test