npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

vectorizejs

v0.2.1

Published

Automatically generate and sync vector embeddings for PostgreSQL text data

Downloads

168

Readme

VectorizeJS

npm version License Node.js CI

VectorizeJS is a Node.js module that automates the vectorization of text data stored in a PostgreSQL database. It listens for changes in a specified table, chunks the text content, generates embeddings using a user-defined function, and stores the embeddings back into the database. Designed for scalability and robustness, it includes features like asynchronous processing, error handling with retries, concurrency control, and comprehensive logging.

Table of Contents

Features

  • Automated Vectorization: Automatically generates embeddings for new or updated text data in your PostgreSQL database.
  • Token-Based Chunking: Splits large texts into manageable chunks based on token count, respecting model token limits.
  • Asynchronous Processing: Processes chunks concurrently with configurable concurrency limits for optimal performance.
  • Error Handling and Retries: Robust error handling with an exponential backoff retry mechanism for transient errors.
  • Comprehensive Logging: Detailed logs using the winston library, aiding in monitoring and debugging.
  • Resource Management: Concurrency control using p-limit to prevent resource exhaustion.
  • Graceful Shutdown: Handles process termination signals to close database connections properly.

Installation

Install VectorizeJS and its peer dependencies using npm:

npm install vectorizejs

Install the required peer dependencies:

npm install pg gpt-3-encoder winston p-limit

Requirements

  • Node.js: Version 12 or higher.
  • PostgreSQL: Version 12 or higher with the pgvector extension installed.
  • Database: A PostgreSQL database with appropriate tables set up.

Usage

1. Import the Module

const vectorize = require('vectorizejs');

2. Define Your Embedding Function

Implement an asynchronous function that generates embeddings from text. This function can call external APIs like OpenAI's embedding API.

const axios = require('axios');

async function embeddingFunction(text) {
  try {
    const response = await axios.post('https://api.example.com/embed', { text });
    return response.data.embedding; // Should be a vector/array of numbers
  } catch (error) {
    throw error;
  }
}

3. Configure and Run VectorizeJS

vectorize({
  connectionString: 'your_postgresql_connection_string',
  sourceTable: 'your_source_table',
  contentColumn: 'your_content_column',
  embeddingTable: 'your_embedding_table',
  embeddingDimensions: 1536, // Adjust based on your embedding model
  embeddingFunction: embeddingFunction,
  chunkSize: 800,            // Optional, default is 800 tokens
  chunkOverlap: 200,         // Optional, default is 200 tokens
  maxConcurrentChunks: 5,    // Optional, default is 5
  maxRetries: 3,             // Optional, default is 3
});

Configuration Options

  • connectionString (string, required): PostgreSQL connection string.

  • sourceTable (string, required): Name of the source table to monitor for changes.

  • contentColumn (string, required): Column in the source table that contains the text content.

  • embeddingTable (string, required): Name of the table where embeddings will be stored.

  • embeddingDimensions (number, required): Dimensionality of the embeddings produced by your embedding function.

  • embeddingFunction (function, required): Asynchronous function that takes a string and returns an embedding vector.

  • chunkSize (number, optional): Maximum number of tokens per chunk. Default is 800.

  • chunkOverlap (number, optional): Number of tokens to overlap between chunks. Default is 200.

  • maxConcurrentChunks (number, optional): Maximum number of chunks to process concurrently. Default is 5.

  • maxRetries (number, optional): Maximum number of retries for transient errors during embedding generation. Default is 3.

Embedding Function

Your embeddingFunction should be an asynchronous function that accepts a text string and returns a Promise resolving to an embedding vector (an array of numbers). Here's an example using OpenAI's API:

const { Configuration, OpenAIApi } = require('openai');

const configuration = new Configuration({
  apiKey: 'your_openai_api_key',
});

const openai = new OpenAIApi(configuration);

async function embeddingFunction(text) {
  try {
    const response = await openai.createEmbedding({
      input: text,
      model: 'text-embedding-ada-002',
    });
    return response.data.data[0].embedding;
  } catch (error) {
    throw error;
  }
}

Note: Ensure that your embedding function handles errors appropriately by throwing exceptions, which the retry mechanism will catch.

Logging

VectorizeJS uses the winston library for logging. By default, it logs to the console with timestamps and log levels. You can customize the logging behavior by modifying the logger configuration in the source code.

const logger = winston.createLogger({
  level: 'info', // Change to 'debug' for more verbose output
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.printf(
      ({ timestamp, level, message }) => `${timestamp} [${level}]: ${message}`
    )
  ),
  transports: [new winston.transports.Console()],
});

Error Handling and Retries

VectorizeJS includes robust error handling:

  • Retry Mechanism: Transient errors during the embedding process trigger retries with exponential backoff.

  • Maximum Retries: Configurable via the maxRetries option. Default is 3.

  • Logging Errors: All errors are logged with detailed messages to aid in debugging.

Concurrency Control

To prevent resource exhaustion, VectorizeJS limits the number of concurrent chunk processing operations:

  • Concurrency Limit: Configurable via the maxConcurrentChunks option. Default is 5.

  • Adjusting the Limit: Increase or decrease based on your system's capacity and the embedding service's rate limits.

Graceful Shutdown

VectorizeJS handles process termination signals to ensure a graceful shutdown:

process.on('SIGINT', async () => {
  logger.info('Shutting down VectorizeJS...');
  await client.end();
  process.exit();
});

This ensures that database connections are closed properly, preventing potential data corruption or connection leaks.

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

  1. Fork the repository.
  2. Create your feature branch (git checkout -b feature/YourFeature).
  3. Commit your changes (git commit -am 'Add some feature').
  4. Push to the branch (git push origin feature/YourFeature).
  5. Open a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.


Disclaimer: Ensure that you comply with the terms and conditions of any third-party services (like OpenAI) that you use with this module. Handle API keys and sensitive information securely.