npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@deonis/hive

v1.2.2

Published

![Hive](./img/hive.png)

Downloads

228

Readme

Hive (Work in progress)

Hive

Hive is a lightweight document database and vector search engine built with Node.js. It provides an efficient way to store, retrieve, and search documents using vector embeddings.

Features

Document storage and retrieval
Vector-based similarity search
Automatic document processing and tokenization
Support for various file formats (txt, doc, docx, pdf, png, jpg, jpeg) (any-text, transformers.js)
In-memory and on-disk persistence

Installation

npm i @deonis/hive 

or

git clone https://github.com/dspasyuk/hive

Usage

import Hive from '@deonis/hive';  
await Hive.init({
  dbName: "Documents",
  pathToDB: "/path/to/my/database.json",
  pathToDocs: "/path/to/documents",
  documents: {
    text: [".txt", ".doc", ".docx", ".pdf"]
  },
  minSliceSize: 200,
});

First the Hive will take all your text files from ./docs folder, split it in to chunks (512 tokens), convert them vectors and create vector database in './db/Documents/db.json'

Once database is created you can do a vector search using the following:

const vector = await Hive.getVector("Hello World", Hive.TransOptions); // uses transformer.js to generate a vector
const results = await Hive.find(vector.data, 10); // 10 is topK, number of searches to return  
console.log(results);

Options Description

Hive.sliceSize = 512;  
Hive.models = {text: "Xenova/all-MiniLM-L6-v2", image:"Xenova/clip-vit-base-patch16"};  
Hive.documents = {text: [".txt", ".doc", ".docx", ".pdf"], image:[".png", ".jpg", ".jpeg"]};  
Hive.TransOptions = { pooling: "mean", normalize: false };  
Hive.escapeRules={"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;","\\":"\\\\","/":"\\/"};  

Hive.sliceSize (number): Defines the size of text slices for large documents during vector generation. The default value is 512.  

Hive.models (object): Specifies the models used for generating embeddings.  
  text (string): The model for text embeddings, e.g., "Xenova/all-MiniLM-L6-v2".  
  image (string): The model for image embeddings, e.g., "Xenova/clip-vit-base-patch16".  

Hive.documents (object): Defines the types of documents that can be processed for embeddings.  
  text (array): List of supported text file extensions, e.g., [".txt", ".doc", ".docx", ".pdf"].  
  image (array): List of supported image file extensions, e.g., [".png", ".jpg", ".jpeg"].  

Hive.TransOptions (object): Configures options for transformers.  
  pooling (string): Specifies the pooling method, e.g., "mean".  
  normalize (boolean): Determines if vectors should be normalized. Default is false.  

Hive.escapeRules (object): Defines escape sequences for special characters during text parsing.  
  &: &amp;  
  <: &lt;  
  >: &gt;  
  ": &quot;  
  ': &#39;  
  \: \\  
  /: \/  

These options allow you to customize how Hive processes and stores text and image embeddings. Adjust these settings based on your specific requirements and the types of documents you plan to work with.

Alternative aproach of creating a database

By default Hive user transformers.js (Xenova/all-MiniLM-L6-v2) to create vector embeddings for your data and query but you can easily bypass that using inserOne function

Hive.insertOne({vector: Array,  meta: {}});

Speed

The database was optimized to handle well above 1,000,000 entries (512 tokens) large. On Average it takes around 30 ms to search a database with 30,000 entries (AMD Ryzen 7 3700X 8-Core Processor)

Insert Data

Hive.insertOne(entry)

entry: An object containing the vector and metadata for the document. {vector:[], meta:{}}

Inserts a single entry into the collection. The entry includes the vector (features extracted from text) and associated metadata.

Update Data

Hive.updateOne(query, entry)

entry: An object containing the vector and metadata for the document. {vector:[], meta:{}}
query: An object containing filePath {filePath:"/path/to/doc.txt"}

Inserts a single entry into the collection. The entry includes the vector (features extracted from text) and associated metadata.

Insert Array of Data

Hive.insertMany(entries)

entries: An array of objects, where each object contains a vector and meta.

Inserts multiple entries into the collection at once. Automatically saves the database to disk after bulk insertion.

Query Data

Hive.find(queryVector, topK = 5)

queryVector: The vector to compare against.
topK: Number of top similar results to return (default: 5).

Finds the top K vectors similar to the queryVector based on cosine similarity. Returns the most similar documents from the collection.

License MIT This project is licensed under the MIT License.