brainnet-tokenizer
v1.0.0
Published
A simple and efficient tokenizer for natural language processing tasks.
Downloads
63
Maintainers
Keywords
Readme
Tokenizer
A simple and efficient tokenizer for natural language processing tasks. This tokenizer supports multiple languages and handles special characters effectively.
Features
- Tokenizes text into words and special characters.
- Encodes text into token IDs.
- Decodes token IDs back into text.
- Saves and loads vocabulary from a file.
- Supports multiple languages.
Installation
To use the Tokenizer
class, you need to install the package from npm:
npm install @brainnet/tokenizer
Usage
Here is an example of how to use the Tokenizer
class:
const Tokenizer = require('@brainnet/tokenizer');
// Create a Tokenizer instance
const tokenizer = new Tokenizer();
// Tokenize the text
const text = "Tokenization is a fundamental step in natural language processing.";
const tokens = tokenizer.tokenize(text);
console.log("Tokens:", tokens);
// Encode the text
const encodedResult = tokenizer.encode(text);
console.log("Encoding Result:", encodedResult);
// Decode the text
const decodedResult = tokenizer.decode(encodedResult.encodedArray);
console.log("Decoding Result:", decodedResult);
// Save the vocabulary to a file
const vocabularyFilePath = 'vocabulary.json';
tokenizer.saveVocabulary(vocabularyFilePath);
console.log("Vocabulary saved to", vocabularyFilePath);
// Create a new Tokenizer instance and load the vocabulary
const newTokenizer = new Tokenizer();
newTokenizer.loadVocabulary(vocabularyFilePath);
console.log("Vocabulary loaded from", vocabularyFilePath);
// Encode and decode the text using the loaded vocabulary
const newEncodedResult = newTokenizer.encode(text);
const newDecodedResult = newTokenizer.decode(newEncodedResult.encodedArray);
console.log("New Encoding Result:", newEncodedResult);
console.log("New Decoding Result:", newDecodedResult);
// Convert token ID to token
const tokenId = tokenizer.getTokenId("Tokenization");
const token = tokenizer.getToken(tokenId);
console.log(`Token ID ${tokenId} corresponds to token: "${token}"`);
API
Tokenizer
constructor()
Creates an instance of Tokenizer.
tokenize(text: string): string[]
Tokenizes the input text into words and special characters.
getTokenId(token: string): number
Adds a token to the vocabulary if it doesn't exist, and returns its ID.
getToken(tokenId: number): string | null
Converts a token ID back to its corresponding token.
getVocabularySize(): number
Returns the size of the vocabulary.
encode(text: string): Object
Encodes the input text into an array of token IDs.
decode(encodedArray: number[]): Object
Decodes an array of token IDs back into text.
saveVocabulary(filePath: string): void
Saves the vocabulary to a file.
loadVocabulary(filePath: string): void
Loads the vocabulary from a file.
License
This project is licensed under the Apache-2.0 License.