npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

docs2vector

v1.0.0

Published

A tool to process markdown files from GitHub repositories and store them in Upstash Vector

Downloads

7

Readme

GitHub Docs Vectorizer

A Node.js tool to process Markdown files from GitHub repositories, generate embeddings, and store them in Upstash Vector database. Perfect for building document search systems, AI-driven documentation assistants, or knowledge bases.

Features

  • Clone any GitHub repository

  • Recursively find all Markdown (.md) and MDX (.mdx) files

  • Chunk documents using LangChain's RecursiveCharacterTextSplitter for better text segmentation

  • Supports both OpenAI and Upstash embeddings

  • Stores document chunks and metadata in Upstash Vector for enhanced retrieval

  • Handles cleanup automatically

  • Preserves file metadata for better context during retrieval

Prerequisites

  • Node.js (v16 or higher) installed on your machine
  • NPM or Yarn for package management
  • GitHub personal access token (required for repository access)
  • Upstash Vector database account (to store vectors)
  • OpenAI API key (optional, for generating embeddings)

How to Find Your GitHub Token

  1. Go to GitHub.com and sign in to your account
  2. Click on your profile picture in the top-right corner
  3. Go to Settings > Developer settings > Personal access tokens > Tokens (classic)
  4. Click Generate new token > Generate new token (classic)
  5. Give your token a descriptive name in the "Note" field
  6. Select the following scopes:
    • repo (Full control of private repositories)
    • read:org (Read organization data)
  7. Click Generate token
  8. Important: Copy the token immediately and store it securely. You won't be able to see it again!

Note: If you're only accessing public repositories, you can create a token with just the public_repo scope instead of the full repo scope.

For security best practices:

  • Never commit your token to version control
  • Use environment variables or secure secret management
  • Set an expiration date for your token
  • Only grant the minimum required permissions

Installation Guide

  1. Clone the repository or create a new directory:
mkdir github-docs-vectorizer
cd github-docs-vectorizer
  1. Ensure the following files are included in your directory:

    • script.js: The main script for processing
    • package.json: Manages project dependencies
    • .env: Contains your environment variables (explained below)
  2. Install dependencies:

npm install
  1. Set up a .env file in the root directory of your project with your credentials:
# Required for accessing GitHub repositories
GITHUB_TOKEN=your_github_token

# Required for storing vectors in Upstash
UPSTASH_VECTOR_REST_URL=your_upstash_vector_url
UPSTASH_VECTOR_REST_TOKEN=your_upstash_vector_token

# Optional: Provide if using OpenAI embeddings
OPENAI_API_KEY=your_openai_api_key

Usage

Run the script by providing the GitHub repository URL as an argument:

node script.js https://github.com/username/repository

Example:

node script.js https://github.com/facebook/react

The script will:

  1. Clone the specified repository
  2. Find all Markdown files
  3. Split content into chunks
  4. Generate embeddings (using either OpenAI or Upstash)
  5. Store the chunks in your Upstash Vector database
  6. Clean up temporary files

Configuration

Embedding Options

Supported Embedding Providers

  1. OpenAI Embeddings (default if API key is provided)

    • Requires OPENAI_API_KEY in .env
    • Uses OpenAI's text-embedding-ada-002 model
  2. Upstash Embeddings (used when OpenAI API key is not provided)

    • No additional configuration needed
    • Uses Upstash's built-in embedding service

Customizing Document Chunking

To adjust how documents are split into chunks, you can update the configuration in script.js:

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,    // Adjust chunk size as needed
  chunkOverlap: 200   // Adjust overlap as needed
});

Metadata

Metadata accompanies each stored chunk for improved context:

  • Original file name
  • File type (Markdown or MDX)
  • Relative file path in the repository
  • Document source for the specific chunk of text

Error Handling

The script is designed to handle errors gracefully in the following cases:

  • Invalid repository URLs provided
  • Missing or incorrect credentials
  • Unable to access or read the required files
  • Connectivity or network-related problems
  • Network problems

In case of errors, the script will:

  1. Log the error message
  2. Clean up any temporary files
  3. Exit with a non-zero status code

Contributing

Feel free to submit issues and enhancement requests!

License

MIT License - feel free to use this tool for any purpose.

Credits

This tool uses the following open-source packages:

  • LangChain: Handles document processing and vector store integration
  • Octokit: Facilitates interactions with the GitHub API
  • simple-git: Manages operations on Git repositories
  • Upstash Vector: Enables seamless storage and retrieval of document vectors