npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@bluryar/tokenizers-js

v0.1.9

Published

基于 [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) 的JavaScript封装,使用WebAssembly实现高性能的文本标记化功能。

Downloads

517

Readme

@bluryar/tokenizers-js

基于 HuggingFace Tokenizers 的JavaScript封装,使用WebAssembly实现高性能的文本标记化功能。

功能特点

  • 使用Rust和WebAssembly实现高性能文本标记化
  • 支持浏览器环境
  • 提供完整的TypeScript类型定义
  • 支持批量编码和解码

安装

npm install @bluryar/tokenizers-js
# 或
pnpm add @bluryar/tokenizers-js
# 或
yarn add @bluryar/tokenizers-js

使用方法

import { init, TokenizerWrapper } from '@bluryar/tokenizers-js'

// 初始化WASM
await init()

// 创建tokenizer实例
const config = await fetch('path/to/tokenizer.json').then(r => r.text())
const tokenizer = new TokenizerWrapper(config)

// 编码文本
const encoding = await tokenizer.encode('Hello, world!', true)
console.log(encoding.tokens)    // 获取分词结果
console.log(encoding.ids)       // 获取token ID

// 解码
const text = await tokenizer.decode(encoding.ids, true)
console.log(text)              // 'Hello, world!'

API 文档

TokenizerWrapper

主要的tokenizer封装类,提供以下方法:

encode(text: string, addSpecialTokens: boolean): Promise

将文本编码为token。

  • text: 输入文本
  • addSpecialTokens: 是否添加特殊token
  • 返回: 包含编码结果的EncodingWrapper对象

encodeBatch(texts: string[], addSpecialTokens: boolean): Promise<EncodingWrapper[]>

批量编码多个文本。

  • texts: 输入文本数组
  • addSpecialTokens: 是否添加特殊token
  • 返回: EncodingWrapper对象数组

decode(ids: number[], skipSpecialTokens: boolean): Promise

将token ID解码为文本。

  • ids: token ID数组
  • skipSpecialTokens: 是否跳过特殊token
  • 返回: 解码后的文本

decodeBatch(sentences: number[][], skipSpecialTokens: boolean): Promise<string[]>

批量解码多组token ID。

  • sentences: token ID二维数组
  • skipSpecialTokens: 是否跳过特殊token
  • 返回: 解码后的文本数组

EncodingWrapper

编码结果的包装类,提供以下属性:

  • tokens: string[] - 分词结果
  • ids: Uint32Array - token ID
  • typeIds: Uint32Array - token类型ID
  • wordIds: (number | null)[] - 词ID
  • offsets: [number, number][] - token在原文中的位置偏移
  • specialTokensMask: Uint32Array - 特殊token掩码
  • attentionMask: Uint32Array - 注意力掩码

浏览器兼容性

  • 需要支持WebAssembly的现代浏览器
  • 推荐使用Chrome 57+、Firefox 52+、Safari 11+、Edge 16+

开发指南

  1. 克隆仓库:
git clone https://github.com/bluryar/tokenizers-js.git
cd tokenizers-js
  1. 安装依赖:
# 构建wasm
wasm-pack build --target web --out-dir packages/tokenizers-wasm
pnpm install
  1. 构建:
pnpm build