npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@robypag/langchain-splitter

v0.1.6

Published

A small wrapper module to simplify files and buffers tokenization using langchain

Downloads

9

Readme

Tokenizer Utility

This is a small utility I have built to support in other AI-related projects. It doesn't do much and I did not want to create more than this: it does exactly what I need.

If it can help you or you feel it is worth an upgrade, feel free to fork this. Pull-requests are warmly welcome.

Why

While working with Reality Augmented Generation, you usually have the need of processing a file in order to generate embeddings for it. The common technique is to split the file in chunks, then generate embeddings for each chunk.

It is a repetitive and tedious task and instead of copying/pasting the same function over and over again, I decided to build a small library.

What

Tokenizer exposes two main functions:

  • tokenizeFile
  • tokenizeFromStringOrBuffer

They both do the same thing, but starting from a different point: as the name implies, you can provide a file path to tokenizeFile whereas you can provide a string or a buffer to tokenizeFromStringOrBuffer.

Supported Files

It currently supports files that can include text: pdf, doc and docx and text based files like txt, csv, etc...

It applies an heuristic approach to best determine which kind of file or buffer it is provided with:

  • tokenizeFile first uses the mime-types module to determine the file type. If this fails (mainly because the provided file has a mismatching extension or does not have an extension at all), it uses the file-type module to look at the file content and determine its type.

  • tokenizeFromStringOrBuffer assumes that if the provided content is a string then the resulting file is a text-based one. If the provided content is a buffer, it uses file-type as above to look at the buffer content and determine which kind of file is and it generates a temporary file using the returned extension. Since file-type does not support text-files, it returns an undefined value if the buffer contains a string or a text-only buffer: the function therefore generates a txt temporary file. After temporary file generation, it calls tokenizeFile providing the temp path to it.

Langchain Parameters

In all cases, this library uses Langchain's function RecursiveCharacterTextSplitter to process the given text. You can check its signature here.

This library currently only uses 2 of them:

  • chunkSize: the size of each text chunk in bytes. Defaults to 1000
  • chunkOverlap: amount of bytes that can overlap between two adjacent chunks. Defaults to 200

Who

Me myself and I.

License

See LICENSE