npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

chunk-match

v1.1.2

Published

NodeJS library that semantically chunks text and matches it against a user query using cosine similarity for precise and relevant text retrieval

Downloads

325

Readme

🕵️‍♂️ chunk-match

A NodeJS library that semantically chunks text and matches it against a user query using cosine similarity for precise and relevant text retrieval.

Installation

npm install chunk-match

Features

  • Semantic text chunking with configurable options
  • Query matching using cosine similarity
  • Configurable similarity thresholds and chunk sizes
  • Returns chunks sorted by relevance with similarity scores
  • Built on top of semantic-chunking for robust text processing
  • Support for various ONNX embedding models

Usage

import { matchChunks } from 'chunk-match';

const documents = [
    {
        document_name: "doc1.txt",
        document_text: "Your document text here..."
    },
    {
        document_name: "doc2.txt",
        document_text: "Another document text..."
    }
];

const query = "What are the key points?";

const options = {
    maxResults: 5,
    minSimilarity: 0.5,
    chunkingOptions: {
        maxTokenSize: 500,
        similarityThreshold: 0.5,
        dynamicThresholdLowerBound: 0.4,
        dynamicThresholdUpperBound: 0.8,
        numSimilaritySentencesLookahead: 3,
        combineChunks: true,
        combineChunksSimilarityThreshold: 0.8,
        onnxEmbeddingModel: "nomic-ai/nomic-embed-text-v1.5",
        dtype: 'q8',
        chunkPrefixDocument: "search_document",
        chunkPrefixQuery: "search_query"
    }
};

const results = await matchChunks(documents, query, options);
console.log(results);

API

matchChunks(documents, query, options)

Parameters

  • documents required (Array): Array of document objects with properties:

    • document_name (string): Name/identifier of the document
    • document_text (string): Text content to be chunked and matched
  • query required (string): The search query to match against documents

  • options optional (Object): Configuration options

    • maxResults (number): Maximum number of results to return (default: 10)
    • minSimilarity (number): Minimum similarity threshold for matches (default: 0.475)
    • chunkingOptions (Object): Options for text chunking
      • maxTokenSize (number): Maximum token size for chunks (default: 500)
      • similarityThreshold (number): Threshold for semantic similarity (default: 0.5)
      • dynamicThresholdLowerBound (number): Lower bound for dynamic thresholding (default: 0.475)
      • dynamicThresholdUpperBound (number): Upper bound for dynamic thresholding (default: 0.8)
      • numSimilaritySentencesLookahead (number): Number of sentences to look ahead (default: 2)
      • combineChunks (boolean): Whether to combine similar chunks (default: true)
      • combineChunksSimilarityThreshold (number): Threshold for combining chunks (default: 0.6)
      • onnxEmbeddingModel (string): ONNX model to use for embeddings (see Models section below) (default: Xenova/all-MiniLM-L6-v2)
      • dtype: String (optional, default fp32) - Precision of the embedding model (options: fp32, fp16, q8, q4).
      • chunkPrefixDocument (string): Prefix for document chunks (for embedding models that support task prefixes) (default: null)
      • chunkPrefixQuery (string): Prefix for query chunk (for embedding models that support task prefixes) (default: null)

📗 For more details on the chunking options, see the semantic-chunking documentation

🚨 Note on Model Loading 🚨

The first time you use a specific embedding model, it will take longer to process as the model needs to be downloaded and cached locally, please be patient. Subsequent uses will be much faster since the cached model will be used.

Returns

Array of match results, each containing:

  • chunk (string): The matched text chunk
  • document_name (string): Source document name
  • document_id (number): Document identifier
  • chunk_number (number): Chunk sequence number
  • token_length (number): Length in tokens
  • similarity (number): Similarity score (0-1)

Embedding Models

This library supports various ONNX embedding models through the semantic-chunking package. Most models have a quantized version available (set onnxEmbeddingModelQuantized: true), which offers better performance with minimal impact on accuracy.

For a complete list of supported models and their characteristics, see the semantic-chunking embedding models documentation.

onnxEmbeddingModel

  • Type: String
  • Default: Xenova/all-MiniLM-L6-v2
  • Description: Specifies the model used to generate sentence embeddings. Different models may yield different qualities of embeddings, affecting the chunking quality, especially in multilingual contexts.
  • Resource Link: ONNX Embedding Models
    Link to a filtered list of embedding models converted to ONNX library format by Xenova.
    Refer to the Model table below for a list of suggested models and their sizes (choose a multilingual model if you need to chunk text other than English).

dtype

  • Type: String
  • Default: fp32
  • Description: Indicates the precision of the embedding model. Options are fp32, fp16, q8, q4. fp32 is the highest precision but also the largest size and slowest to load. q8 is a good compromise between size and speed if the model supports it. All models support fp32, but only some support fp16, q8, and q4.

Curated ONNX Embedding Models

| Model | Precision (dtype) | Link | Size | | -------------------------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------- | | nomic-ai/nomic-embed-text-v1.5 | fp32, q8 | https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 | 548 MB, 138 MB | | thenlper/gte-base | fp32 | https://huggingface.co/thenlper/gte-base | 436 MB | | Xenova/all-MiniLM-L6-v2 | fp32, fp16, q8 | https://huggingface.co/Xenova/all-MiniLM-L6-v2 | 23 MB, 45 MB, 90 MB | | Xenova/paraphrase-multilingual-MiniLM-L12-v2 | fp32, fp16, q8 | https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2 | 470 MB, 235 MB, 118 MB | | Xenova/all-distilroberta-v1 | fp32, fp16, q8 | https://huggingface.co/Xenova/all-distilroberta-v1 | 326 MB, 163 MB, 82 MB | | BAAI/bge-base-en-v1.5 | fp32 | https://huggingface.co/BAAI/bge-base-en-v1.5 | 436 MB | | BAAI/bge-small-en-v1.5 | fp32 | https://huggingface.co/BAAI/bge-small-en-v1.5 | 133 MB | | yashvardhan7/snowflake-arctic-embed-m-onnx | fp32 | https://huggingface.co/yashvardhan7/snowflake-arctic-embed-m-onnx | 436 MB |

Each of these parameters allows you to customize the chunkit function to better fit the text size, content complexity, and performance requirements of your application.

Web UI

Checkout the webui folder for a web-based interface for experimenting with and tuning Chunk Match settings. This tool provides a visual way to test and configure the chunk-match library's semantic text matching capabilities to get optimal results for your specific use case. Once you've found the best settings, you can generate code to implement them in your project.

chunk-match_web-ui

License

This project is licensed under the MIT License - see the LICENSE file for details.

Appreciation

If you enjoy this library please consider sending me a tip to support my work 😀

🍵 tip me here