pgml

v1.1.1

Published

8 months ago

Open Source Alternative for Building End-to-End Vector Search Applications without OpenAI & Pinecone

Downloads

1,135

0High
0Medium
0Low

shyper

hyperparam

postgres machine learning vector databases embeddings

Open Source Alternative for Building End-to-End Vector Search Applications without OpenAI & Pinecone

Overview

JavaScript SDK is designed to facilitate the development of scalable vector search applications on PostgreSQL databases. With this SDK, you can seamlessly manage various database tables related to documents, text chunks, text splitters, LLM (Language Model) models, and embeddings. By leveraging the SDK's capabilities, you can efficiently index LLM embeddings using PgVector for fast and accurate queries.

Documentation: PostgresML SDK Docs

Examples Folder: Examples

Key Features

Automated Database Management: With the SDK, you can easily handle the management of database tables related to documents, text chunks, text splitters, LLM models, and embeddings. This automated management system simplifies the process of setting up and maintaining your vector search application's data structure.
Embedding Generation from Open Source Models: The JavaScript SDK provides the ability to generate embeddings using hundreds of open source models. These models, trained on vast amounts of data, capture the semantic meaning of text and enable powerful analysis and search capabilities.
Flexible and Scalable Vector Search: The JavaScript SDK empowers you to build flexible and scalable vector search applications. The JavaScript SDK seamlessly integrates with PgVector, a PostgreSQL extension specifically designed for handling vector-based indexing and querying. By leveraging these indices, you can perform advanced searches, rank results by relevance, and retrieve accurate and meaningful information from your database.

Use Cases

Embeddings, the core concept of the JavaScript SDK, find applications in various scenarios, including:

Search: Embeddings are commonly used for search functionalities, where results are ranked by relevance to a query string. By comparing the embeddings of query strings and documents, you can retrieve search results in order of their similarity or relevance.
Clustering: With embeddings, you can group text strings by similarity, enabling clustering of related data. By measuring the similarity between embeddings, you can identify clusters or groups of text strings that share common characteristics.
Recommendations: Embeddings play a crucial role in recommendation systems. By identifying items with related text strings based on their embeddings, you can provide personalized recommendations to users.
Anomaly Detection: Anomaly detection involves identifying outliers or anomalies that have little relatedness to the rest of the data. Embeddings can aid in this process by quantifying the similarity between text strings and flagging outliers.
Classification: Embeddings are utilized in classification tasks, where text strings are classified based on their most similar label. By comparing the embeddings of text strings and labels, you can classify new text strings into predefined categories.

How the JavaScript SDK Works

The JavaScript SDK streamlines the development of vector search applications by abstracting away the complexities of database management and indexing. Here's an overview of how the SDK works:

Automatic Document and Text Chunk Management: The SDK provides a convenient interface to manage documents and pipelines, automatically handling chunking and embedding for you. You can easily organize and structure your text data within the PostgreSQL database.
Open Source Model Integration: With the SDK, you can seamlessly incorporate a wide range of open source models to generate high-quality embeddings. These models capture the semantic meaning of text and enable powerful analysis and search capabilities.
Embedding Indexing: The JavaScript SDK utilizes the PgVector extension to efficiently index the embeddings generated by the open source models. This indexing process optimizes search performance and allows for fast and accurate retrieval of relevant results.
Querying and Search: Once the embeddings are indexed, you can perform vector-based searches on the documents and text chunks stored in the PostgreSQL database. The SDK provides intuitive methods for executing queries and retrieving search results.

Quickstart

Follow the steps below to quickly get started with the JavaScript SDK for building scalable vector search applications on PostgresML databases.

Prerequisites

Before you begin, make sure you have the following:

PostgresML Database: Ensure you have a PostgresML database version >=2.7.7. You can spin up a database using Docker or sign up for a free GPU-powered database.
Set the DATABASE_URL environment variable to the connection string of your PostgresML database.

Installation

To install the JavaScript SDK, use npm:

npm i pgml

Sample Code

Once you have the JavaScript SDK installed, you can use the following sample code as a starting point for your vector search application:

const pgml = require("pgml");

const main = async () => {
    const collection = pgml.newCollection("my_javascript_collection");

Explanation:

This code imports pgml and creates an instance of the Collection class which we will add pipelines and documents onto

Continuing within const main

    const model = pgml.newModel();
    const splitter = pgml.newSplitter();
    const pipeline = pgml.newPipeline("my_javascript_pipeline", model, splitter);
    await collection.add_pipeline(pipeline);

Explanation

The code creates an instance of Model and Splitter using their default arguments.
Finally, the code constructs a pipeline called "my_javascript_pipeline" and add it to the collection we Initialized above. This pipeline automatically generates chunks and embeddings for every upserted document.

Continuing with const main

    const documents = [
        {
          id: "Document One",
          text: "document one contents...",
        },
        {
          id: "Document Two",
          text: "document two contents...",
        },
    ];
    await collection.upsert_documents(documents);

Explanation

This code crates and upserts some filler documents.
As mentioned above, the pipeline added earlier automatically runs and generates chunks and embeddings for each document.

Continuing within const main

    const queryResults = await collection
        .query()
        .vector_recall("Some user query that will match document one first", pipeline)
        .limit(2)
        .fetch_all();

    // Convert the results to an array of objects
    const results = queryResults.map((result) => {
      const [similarity, text, metadata] = result;
      return {
        similarity,
        text,
        metadata,
      };
    });
    console.log(results);

    await collection.archive();

Explanation:

The query method is called to perform a vector-based search on the collection. The query string is Some user query that will match document one first, and the top 2 results are requested.
The search results are converted to objects and printed.
Finally, the archive method is called to archive the collection and free up resources in the PostgresML database.

Call main function.

main().then(() => {
  console.log("Done with PostgresML demo");
});

Running the Code

Open a terminal or command prompt and navigate to the directory where the file is saved.

Execute the following command:

node vector_search.js

You should see the search results printed in the terminal. As you can see, our vector search engine did match document one first.

[
  {
    similarity: 0.8506832955692104,
    text: 'document one contents...',
    metadata: { id: 'Document One' }
  },
  {
    similarity: 0.8066114609244565,
    text: 'document two contents...',
    metadata: { id: 'Document Two' }
  }
]

Upgrading

Changes between SDK versions are not necessarily backwards compatible. We provide a migrate function to help transition smoothly.

const pgml = require("pgml");
await pgml.migrate()

Developer Setup

This javascript library is generated from our core rust-sdk. Please check rust-sdk documentation for developer setup.

Roadmap

[x] Enable filters on document metadata in vector_search. Issue
[x] text_search functionality on documents using Postgres text search. Issue
[x] hybrid_search functionality that does a combination of vector_search and text_search. Issue
[x] Ability to call and manage OpenAI embeddings for comparison purposes. Issue
[x] Perform chunking on the DB with multiple langchain splitters. Issue
[ ] Save vector_search history for downstream monitoring of model performance. Issue

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Open Source Alternative for Building End-to-End Vector Search Applications without OpenAI & Pinecone

Table of Contents

Overview

Key Features

Use Cases

How the JavaScript SDK Works

Quickstart

Prerequisites

Installation

Sample Code

Upgrading

Developer Setup

Roadmap