npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

line-segmentation-gcp-vision-ocr

v1.0.0

Published

Line segmentation algorithm for GCP Vision OCR.

Downloads

162

Readme

Introduction

Google Vision OCR(optical character recognition) can detect and extract text from images. There are two annotation features that support optical character recognition:

  • TEXT_DETECTION detects and extracts text from any image.
  • DOCUMENT_TEXT_DETECTION also extracts text from an image, but the response is optimized for dense text and documents. Both options are suitable for data extraction from articles and dense text, but second option: DOCUMENT_TEXT_DETECTION has an intelligent segmentation method to merge words which are nearby and form lines and paragraphs.

This feature is meant to intelligible merge the content into paragraphs and blocks of text. Although, the algorithm behind the OCR isn't doing a great job in cases you need to extract text from a document(image) where you got gaps(whitespaces between product's title and value for example).

The image below shows the sample output for a document which should be sectioned:

enter image description here

This behaviour creates a real problem in the process of information extraction. This particular case was my problem, I was supposed to find a solution to provide an adequate output.

Explanation

This algorithm provides a different approach of processing the raw data from the GCP Vision response, differently understanding the positioning in the page. Phases of the algorithm:

  • Merge words/characters which are very close: the first stage concatenate nearby characters to form words and sentences with characters which have bounding polygons almost merged. This phase helps to reduce the computation needed for the next steps.

  • Creating bounding polygon: stage two creates an imaginary system of coordinates with each word/sentence in a polygon(as in the image below). enter image description here

  • Combine bounding polygon: the third stage parses through the data and inline the elements. The algorithm tries to fit words into single lines, creating a bigger polygon for each line. (image below) enter image description here

  • Construct lines: in the final stage algorithm merges words/sentences into single lines and returns the result as an array containing each line.

Extra feature

Besides the data parsed from GCP, you can parse a second parameter to algorithm, an array of words/sentences which you want to get rid of. These strings will be excluded from content and wouldn't be processed furthermore.

Optical character recognition isn't perfect, so if you include a big sentence, it'll most likely be properly deleted. We implemented js-levenshtein (admits some mistakes when searching for you string) to match desired string even if the GCP Vision OCR did some mistakes in reading your text from the image.

Usage Guide

Installation

Use the following command to install the package. npm i line-segmentation-gcp-vision-ocr

Usage

All you need to do is to provide the data from the GCP Vision OCR to the init function from the package. The function returns the content processed and segmented by lines. Data should be provided as JSON or directly from GCP response. There is a simple example of usage:

const segmentation = require("line-segmentation-algorithm-to-gcp-vision");

// call GCP Vision and retrieve results
let gcpResponse = {};
let segmentedResult = segmentation.init(gcpResponse[0]['responses'][0]);

Issues

Currently, the algorithm works on scanned documents, horizontally oriented.

Future Work

  • Implementing in package realizing the text orientation. So it can segment in any orientation any text found in the image.
  • Expanded calculation of boundings, calculating all the angles. In this way, it could work on any images, not only on straight photos/scans of documents.