gpt-semantic-cache

v1.0.0

Published

3 months ago

An NPM package for semantic caching of GPT responses using Redis and ANN.

Downloads

0High
0Medium
0Low

sregmi7

semantic cache GPT embeddings ANN Redis

GPT Semantic Cache

An NPM package for semantic caching of GPT responses using Redis and Approximate Nearest Neighbors (ANN) search.

Introduction

The GPT Semantic Cache is a Node.js package that provides a semantic caching mechanism for GPT responses. By leveraging semantic embeddings and approximate nearest neighbors search, the package efficiently caches and retrieves GPT responses based on the semantic similarity of user queries. This reduces redundant API calls to GPT models, saving time and costs, and improving response times for end-users. Queries with similar meaning are retrieved from cache saving the cost associated with an API.

Here are several areas where this can be used:

Technical Customer Support: Technical Support are specific and based of technical docunents so semantic caching can be used to address similar queries
Product Support: Responses to the online shopping products where the specifications or queries to the product is largely static -Other support based services

Features

Semantic Caching: Efficiently cache GPT responses based on semantic similarity.
Supports Multiple Embedding Sources: Use OpenAI or local models for generating embeddings.
Redis Integration: Utilize Redis for fast storage and retrieval of cached data.
Approximate Nearest Neighbors (ANN) Search: Quickly find similar queries using ANN algorithms.
Customizable Settings: Adjust similarity thresholds, cache TTL, and more according to your needs.

Installation

npm install gpt-semantic-cache

Quick Start

Here's a quick example to get you started:

const { SemanticGPTCache } = require('gpt-semantic-cache');

(async () => {
  const cache = new SemanticGPTCache({
    embeddingOptions: {
      type: 'openai',
      openAIApiKey: 'YOUR_OPENAI_API_KEY',
    },
    gptOptions: {
      openAIApiKey: 'YOUR_OPENAI_API_KEY',
      model: 'gpt-3.5-turbo',
    },
    cacheOptions: {
      redisUrl: 'redis://localhost:6379',
      similarityThreshold: 0.8,
      cacheTTL: 3600, // Cache Time-To-Live in seconds
      embeddingSize: 1536, // OpenAI's embedding size
    },
  });

  await cache.initialize();

  const response = await cache.query('What is the capital of France?');
  console.log(response);
})();

Usage

Initialization

To initialize the SemanticGPTCache, you need to provide configuration options for embeddings, GPT model, and caching.

const cache = new SemanticGPTCache({
  embeddingOptions: {
    type: 'local', // 'openai' or 'local'
    modelName: 'sentence-transformers/all-MiniLM-L6-v2', // Only for local models
    openAIApiKey: 'YOUR_OPENAI_API_KEY', // Only for OpenAI embeddings
  },
  gptOptions: {
    openAIApiKey: 'YOUR_OPENAI_API_KEY',
    model: 'gpt-3.5-turbo', // GPT model to use to query gpt if cache misses
    promptPrefix: 'You are an AI assistant.',
  },
  cacheOptions: {
    redisUrl: 'redis://localhost:6379',
    similarityThreshold: 0.8, // Cosine similarity threshold for cache hits
    cacheTTL: 3600, // Time-to-live for cache entries in seconds
    embeddingSize: 384, // Embedding size (384 for local models, 1536 for OpenAI)
  },
});

await cache.initialize();

Initialization Options Explained:

embeddingOptions:
- type: 'openai' or 'local'. Specifies the source of embeddings.
- modelName: The name of the local embedding model to use (e.g., 'sentence-transformers/all-MiniLM-L6-v2').
- openAIApiKey: Your OpenAI API key (required if type is 'openai').
gptOptions:
- openAIApiKey: Your OpenAI API key for accessing the GPT model.
- model: The GPT model to use (e.g., 'gpt-3.5-turbo') in case of cache miss.
- promptPrefix: An optional string to prepend to every prompt sent to the GPT model.
cacheOptions:
- redisUrl: The URL of your Redis instance (e.g., 'redis://localhost:6379').
- similarityThreshold: A number between 0 and 1 representing the cosine similarity threshold for cache hits.
- cacheTTL: The time-to-live for cache entries in seconds.
- embeddingSize: The dimensionality of the embeddings used (e.g., 384 for local models, 1536 for OpenAI).

Querying

To query the cache and get a response:

const response = await cache.query('Your query here', 'Additional context if any');
console.log(response);

If a similar query exists in the cache (based on the similarity threshold), the cached response is returned.
If no similar query is found, the GPT API is called, and the response is cached for future queries.

Configuration Options

The package allows you to customize various settings to fit your needs:

Similarity Threshold: Adjust the similarityThreshold in cacheOptions to control how similar a query needs to be to hit the cache. A higher threshold means only very similar queries will hit the cache.
Cache Time-To-Live (TTL): Set cacheTTL to control how long entries remain in the cache.
Embedding Size: Ensure embeddingSize matches the size of embeddings produced by your chosen embedding model.

Science Behind the Package

Semantic Embeddings

Semantic embeddings are vector representations of text that capture the meaning and context of the text. By converting both user queries and cached queries into embeddings, we can compare them in a high-dimensional space to find semantic similarities.

Approximate Nearest Neighbors Search

To efficiently find similar embeddings in the cache, the package uses the Hierarchical Navigable Small World (HNSW) algorithm for Approximate Nearest Neighbors search. HNSW constructs a graph of embeddings that allows for fast retrieval of nearest neighbors without comparing the query against every cached embedding.

Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors in a multidimensional space. It is a commonly used metric to determine how similar two embeddings are. In this package, after retrieving the nearest neighbors using ANN search, cosine similarity is computed to ensure the retrieved embeddings meet the specified similarity threshold.

Caching Mechanism

The caching mechanism works as follows:

Embedding Generation: When a query is received, it's converted into an embedding using the specified embedding model.
ANN Search: The embedding is used to search the ANN index for similar embeddings.
Similarity Check: Retrieved embeddings are compared using cosine similarity to ensure they meet the similarity threshold.
Cache Hit or Miss:
- Cache Hit: If a similar embedding is found, the associated response is retrieved from Redis and returned.
- Cache Miss: If no similar embedding is found, the query is sent to the GPT API. The response is then cached along with the embedding for future queries.

Examples

Using a Local Embedding Model

const cache = new SemanticGPTCache({
  embeddingOptions: {
    type: 'local',
    modelName: 'sentence-transformers/all-MiniLM-L6-v2',
  },
  gptOptions: {
    openAIApiKey: 'YOUR_OPENAI_API_KEY',
    model: 'gpt-3.5-turbo',
  },
  cacheOptions: {
    redisUrl: 'redis://localhost:6379',
    similarityThreshold: 0.75,
    cacheTTL: 7200, // 2 hours
    embeddingSize: 384, // For MiniLM model
  },
});

await cache.initialize();

const response = await cache.query('Tell me a joke.');
console.log(response);

Adjusting Similarity Threshold

You can adjust the similarityThreshold to control cache sensitivity:

// Higher threshold - only very similar queries will hit the cache
cache.cacheOptions.similarityThreshold = 0.9;

// Lower threshold - more queries will hit the cache, but responses may be less relevant
cache.cacheOptions.similarityThreshold = 0.6;

License

This project is licensed under the MIT License.