@flexpilot-ai/tokenizers
v0.0.1
Published
Node.js binding for huggingface/tokenizers library
Downloads
413
Maintainers
Readme
Main Features
- Fast and Efficient: Leverages Rust's performance for rapid tokenization.
- Versatile: Supports various tokenization models including BPE, WordPiece, and Unigram.
- Easy Integration: Seamlessly use pre-trained tokenizers in your Node.js projects.
- Customizable: Fine-tune tokenization parameters for your specific use case.
- Production-Ready: Designed for both research and production environments.
Installation
Install the package using npm:
npm install @flexpilot-ai/tokenizers
Usage Example
Here's an example demonstrating how to use the Tokenizer class:
import { Tokenizer } from "@flexpilot-ai/tokenizers";
import fs from "fs";
// Read the tokenizer configuration file
const fileBuffer = fs.readFileSync("path/to/tokenizer.json");
const byteArray = Array.from(fileBuffer);
// Create a new Tokenizer instance
const tokenizer = new Tokenizer(byteArray);
// Encode a string
const text = "Hello, y'all! How are you 😁 ?";
const encoded = tokenizer.encode(text, true);
console.log("Encoded:", encoded);
// Decode the tokens
const decoded = tokenizer.decode(encoded, false);
console.log("Decoded:", decoded);
// Use the fast encoding method
const fastEncoded = tokenizer.encodeFast(text, true);
console.log("Fast Encoded:", fastEncoded);
API Reference
Tokenizer
The main class for handling tokenization.
Constructor
constructor(bytes: Array<number>)
Creates a new Tokenizer
instance from a configuration provided as an array of bytes.
bytes
: An array of numbers representing the tokenizer configuration.
Methods
encode
encode(input: string, addSpecialTokens: boolean): Array<number>
Encodes the input text into token IDs.
input
: The text to tokenize.addSpecialTokens
: Whether to add special tokens during encoding.- Returns: An array of numbers representing the token IDs.
decode
decode(ids: Array<number>, skipSpecialTokens: boolean): string
Decodes the token IDs back into text.
ids
: An array of numbers representing the token IDs.skipSpecialTokens
: Whether to skip special tokens during decoding.- Returns: The decoded text as a string.
encodeFast
encodeFast(input: string, addSpecialTokens: boolean): Array<number>
A faster version of the encode
method for tokenizing text.
input
: The text to tokenize.addSpecialTokens
: Whether to add special tokens during encoding.- Returns: An array of numbers representing the token IDs.
Contributing
We welcome contributions! Please see our Contributing Guide for more details.
License
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.
Acknowledgments
- This library is based on the HuggingFace Tokenizers Rust implementation.
- Special thanks to the Rust and Node.js communities for their invaluable resources and support.