@memberjunction/ai-vector-dupe
v2.13.4
Published
MemberJunction: AI Vector/Entity Sync Package - Handles synchronization between Vector DB and MJ CDP Data
Downloads
774
Keywords
Readme
AI Vector Dupe Documentation
AI Vector Dupe is a package designed to identify duplicate records in a database by generating vector representations and finding similar vectors. Users can then take actions, such as merging or deleting the detected duplicates.
Prerequisites
Before using the package, ensure the following requirements are met:
SQL Server with MemberJunction Framework
MemberJunction DocumentationEmbedding Model API Key
Supported embedding models include OpenAI, Mistral, and others supported by MemberJunction.Vector Database API Key
Currently, only Pinecone is supported for vector storage.
How to Run the Package
Follow these steps to use the AI Vector Dupe package:
Load Required Packages
Ensure this package, along with your embedding and vector database packages, is loaded into your application. Verify they are not tree-shaken out.Prepare Records
Create a list of records to search for duplicates.
Note: Currently, this package supports finding duplicates within the same entity. Support for cross-entity duplicate checks is planned for future updates.Call the
getDuplicateRecords
Function
Create an instance of theDuplicateRecordDetector
class and call thegetDuplicateRecords
function with the following parameters:| Parameter | Type | Description | |--------------------|----------------|-----------------------------------------------------------------------------| |
listID
|string
| The ID of the list containing the records to analyze. | |entityID
|string
| The ID of the entity the records belong to. | |probabilityScore
|number
(optional) | The minimum similarity score to consider a record as a potential duplicate. |Return: A
Promise
that resolves after processing. For large datasets, it is recommended not toawait
the result.
Workflow: getDuplicateRecords
Function
The getDuplicateRecords
function performs the following steps:
Fetch Records
Fetches the list bylistID
and retrieves all records contained within it.Generate or Fetch Vectors
- If configured, generates new vectors for all records associated with the specified
entityID
and upserts them into the vector database. - If not configured to upsert new vectors, it queries the vector database to fetch existing vectors for the records.
- If configured, generates new vectors for all records associated with the specified
Search for Similar Vectors
For each vector, queries the vector database to find n similar vectors (where n is user-specified).Fetch Related Records
Fetches database records corresponding to the similar vectors retrieved.Merge Duplicates (Optional)
If configured, merges records marked as duplicates into the source record based on a similarity probability threshold.- Example: If the similarity score exceeds
0.95
, the record is merged.
- Example: If the similarity score exceeds
Track Results
Records are created in the database to log:- The duplicate record search run.
- Which records were analyzed.
- Which records were marked as potential duplicates.
Example Usage
Here is an example of how to use the package:
const { DuplicateRecordDetector } = require('ai-vector-dupe');
// Create an instance of the DuplicateRecordDetector
const detector = new DuplicateRecordDetector();
// Call getDuplicateRecords
detector.getDuplicateRecords({
listID: 'example-list-id',
entityID: 'example-entity-id',
probabilityScore: 0.9
});