tokensize
v1.0.0
Published
The `tokenizer` function uses the `js-tiktoken` library to encode the input string into tokens using the GPT-2 encoding scheme. It then decodes the tokens back into strings, maps the tokens to their positions in the input string using the `mapTokensToChun
Downloads
2
Readme
NPM Module Documentation
The tokenizer
function takes a string as input and returns an object with the following properties:
count
: the number of tokens in the input stringcharacters
: the number of characters in the input stringtext
: the original input stringtokens
: an array of objects, where each object represents a token and its position in the input string. Each token object has the following properties:token
: the token stringstart
: the starting index of the token in the input stringend
: the ending index of the token in the input string
The tokenizer
function uses the js-tiktoken
library to encode the input string into tokens using the GPT-2 encoding scheme. It then decodes the tokens back into strings, maps the tokens to their positions in the input string using the mapTokensToChunks
function, and returns the resulting object.
Usage
To use this module, you can import the tokenizer
function and call it with a string argument. Here's an example:
import { tokenizer } from 'your-module-name';
const input = 'This is a sample input string.';
const result = await tokenizer(input);
console.log(result);
/*
{
count: 7,
characters: 28,
text: 'This is a sample input string.',
tokens: [
{ token: 'This', start: 0, end: 3 },
{ token: 'Ġis', start: 5, end: 7 },
{ token: 'Ġa', start: 8, end: 8 },
{ token: 'Ġsample', start: 10, end: 16 },
{ token: 'Ġinput', start: 18, end: 22 },
{ token: 'Ġstring', start: 24, end: 29 },
{ token: '.', start: 29, end: 29 }
]
}
*/