@deonis/hive
v1.2.2
Published
![Hive](./img/hive.png)
Downloads
228
Readme
Hive (Work in progress)
Hive is a lightweight document database and vector search engine built with Node.js. It provides an efficient way to store, retrieve, and search documents using vector embeddings.
Features
Document storage and retrieval
Vector-based similarity search
Automatic document processing and tokenization
Support for various file formats (txt, doc, docx, pdf, png, jpg, jpeg) (any-text, transformers.js)
In-memory and on-disk persistence
Installation
npm i @deonis/hive
or
git clone https://github.com/dspasyuk/hive
Usage
import Hive from '@deonis/hive';
await Hive.init({
dbName: "Documents",
pathToDB: "/path/to/my/database.json",
pathToDocs: "/path/to/documents",
documents: {
text: [".txt", ".doc", ".docx", ".pdf"]
},
minSliceSize: 200,
});
First the Hive will take all your text files from ./docs folder, split it in to chunks (512 tokens), convert them vectors and create vector database in './db/Documents/db.json'
Once database is created you can do a vector search using the following:
const vector = await Hive.getVector("Hello World", Hive.TransOptions); // uses transformer.js to generate a vector
const results = await Hive.find(vector.data, 10); // 10 is topK, number of searches to return
console.log(results);
Options Description
Hive.sliceSize = 512;
Hive.models = {text: "Xenova/all-MiniLM-L6-v2", image:"Xenova/clip-vit-base-patch16"};
Hive.documents = {text: [".txt", ".doc", ".docx", ".pdf"], image:[".png", ".jpg", ".jpeg"]};
Hive.TransOptions = { pooling: "mean", normalize: false };
Hive.escapeRules={"&":"&","<":"<",">":">",'"':""","'":"'","\\":"\\\\","/":"\\/"};
Hive.sliceSize (number): Defines the size of text slices for large documents during vector generation. The default value is 512.
Hive.models (object): Specifies the models used for generating embeddings.
text (string): The model for text embeddings, e.g., "Xenova/all-MiniLM-L6-v2".
image (string): The model for image embeddings, e.g., "Xenova/clip-vit-base-patch16".
Hive.documents (object): Defines the types of documents that can be processed for embeddings.
text (array): List of supported text file extensions, e.g., [".txt", ".doc", ".docx", ".pdf"].
image (array): List of supported image file extensions, e.g., [".png", ".jpg", ".jpeg"].
Hive.TransOptions (object): Configures options for transformers.
pooling (string): Specifies the pooling method, e.g., "mean".
normalize (boolean): Determines if vectors should be normalized. Default is false.
Hive.escapeRules (object): Defines escape sequences for special characters during text parsing.
&: &
<: <
>: >
": "
': '
\: \\
/: \/
These options allow you to customize how Hive processes and stores text and image embeddings. Adjust these settings based on your specific requirements and the types of documents you plan to work with.
Alternative aproach of creating a database
By default Hive user transformers.js (Xenova/all-MiniLM-L6-v2) to create vector embeddings for your data and query but you can easily bypass that using inserOne function
Hive.insertOne({vector: Array, meta: {}});
Speed
The database was optimized to handle well above 1,000,000 entries (512 tokens) large. On Average it takes around 30 ms to search a database with 30,000 entries (AMD Ryzen 7 3700X 8-Core Processor)
Insert Data
Hive.insertOne(entry)
entry: An object containing the vector and metadata for the document. {vector:[], meta:{}}
Inserts a single entry into the collection. The entry includes the vector (features extracted from text) and associated metadata.
Update Data
Hive.updateOne(query, entry)
entry: An object containing the vector and metadata for the document. {vector:[], meta:{}}
query: An object containing filePath {filePath:"/path/to/doc.txt"}
Inserts a single entry into the collection. The entry includes the vector (features extracted from text) and associated metadata.
Insert Array of Data
Hive.insertMany(entries)
entries: An array of objects, where each object contains a vector and meta.
Inserts multiple entries into the collection at once. Automatically saves the database to disk after bulk insertion.
Query Data
Hive.find(queryVector, topK = 5)
queryVector: The vector to compare against.
topK: Number of top similar results to return (default: 5).
Finds the top K vectors similar to the queryVector based on cosine similarity. Returns the most similar documents from the collection.
License MIT This project is licensed under the MIT License.