charabia-js
v0.2.0
Published
A WebAssembly binding for the charabia multilingual text tokenizer used by Meilisearch.
Downloads
123
Maintainers
Readme
charabia-js
charabia-js
is a WebAssembly binding for the charabia multilingual text tokenizer used by Meilisearch.
Supported scripts / languages
- Latin
- Latin - German
- Greek
- Cyrillic - Georgian
- Chinese CMN 🇨🇳
- Hebrew 🇮🇱
- Arabic
- Japanese 🇯🇵
- Korean 🇰🇷
- Thai 🇹🇭
- Khmer 🇰🇭
More information about the supported scripts and languages can be found in the here.
Installation
npm install charabia-js
Usage
Segmentation
import { segment } from "charabia-js";
console.log(segment("Hello, world!")); // [ 'Hello', ', ', 'world', '!' ]
console.log(segment("你好,世界!")); // [ '你好', ',', '世界', '!' ]
console.log(segment("Hello, 世界!")); // [ 'Hello', ', ', '世界', '!' ]
Tokenization
import { tokenize, TokenKind } from "charabia-js";
import assert from "node:assert";
const tokens = tokenize(
"The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F"
);
let token = tokens[0];
assert.equal(token.lemma, "the");
assert.equal(token.kind, TokenKind.Word);
token = tokens[1];
assert.equal(token.lemma, " ");
assert.equal(token.kind, TokenKind.SoftSeparator);
token = tokens[2];
assert.equal(token.lemma, "quick");
assert.equal(token.kind, TokenKind.Word);
License
This project is licensed under the MIT License - see the LICENSE file for details.