npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@flexpilot-ai/tokenizers

v0.0.1

Published

Node.js binding for huggingface/tokenizers library

Downloads

413

Readme

NPM Version GitHub Actions Workflow Status GitHub License PRs Welcome

Main Features

  • Fast and Efficient: Leverages Rust's performance for rapid tokenization.
  • Versatile: Supports various tokenization models including BPE, WordPiece, and Unigram.
  • Easy Integration: Seamlessly use pre-trained tokenizers in your Node.js projects.
  • Customizable: Fine-tune tokenization parameters for your specific use case.
  • Production-Ready: Designed for both research and production environments.

Installation

Install the package using npm:

npm install @flexpilot-ai/tokenizers

Usage Example

Here's an example demonstrating how to use the Tokenizer class:

import { Tokenizer } from "@flexpilot-ai/tokenizers";
import fs from "fs";

// Read the tokenizer configuration file
const fileBuffer = fs.readFileSync("path/to/tokenizer.json");
const byteArray = Array.from(fileBuffer);

// Create a new Tokenizer instance
const tokenizer = new Tokenizer(byteArray);

// Encode a string
const text = "Hello, y'all! How are you 😁 ?";
const encoded = tokenizer.encode(text, true);
console.log("Encoded:", encoded);

// Decode the tokens
const decoded = tokenizer.decode(encoded, false);
console.log("Decoded:", decoded);

// Use the fast encoding method
const fastEncoded = tokenizer.encodeFast(text, true);
console.log("Fast Encoded:", fastEncoded);

API Reference

Tokenizer

The main class for handling tokenization.

Constructor

constructor(bytes: Array<number>)

Creates a new Tokenizer instance from a configuration provided as an array of bytes.

  • bytes: An array of numbers representing the tokenizer configuration.

Methods

encode
encode(input: string, addSpecialTokens: boolean): Array<number>

Encodes the input text into token IDs.

  • input: The text to tokenize.
  • addSpecialTokens: Whether to add special tokens during encoding.
  • Returns: An array of numbers representing the token IDs.
decode
decode(ids: Array<number>, skipSpecialTokens: boolean): string

Decodes the token IDs back into text.

  • ids: An array of numbers representing the token IDs.
  • skipSpecialTokens: Whether to skip special tokens during decoding.
  • Returns: The decoded text as a string.
encodeFast
encodeFast(input: string, addSpecialTokens: boolean): Array<number>

A faster version of the encode method for tokenizing text.

  • input: The text to tokenize.
  • addSpecialTokens: Whether to add special tokens during encoding.
  • Returns: An array of numbers representing the token IDs.

Contributing

We welcome contributions! Please see our Contributing Guide for more details.

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Acknowledgments

  • This library is based on the HuggingFace Tokenizers Rust implementation.
  • Special thanks to the Rust and Node.js communities for their invaluable resources and support.