@nahanil/zh-tokenizer
v0.1.3
Published
Tokenize Chinese texts into words.
Downloads
39
Readme
@nahanil/zh-tokenizer
Tokenizes Chinese texts into words using CC-CEDICT.
Extended from https://github.com/takumif/cedict-lookup
Installation
Use npm to install:
npm install @nahanil/zh-tokenizer --save
Updated Usage
Make sure to provide the CC-CEDICT data. Will not work with simplified characters
const tokenizer = require('@nahanil/zh-tokenizer')('./cedict.txt')
console.log(tokenizer.tokenize('我是中国人。'))
Usage
Make sure to provide the CC-CEDICT data.
const tokenizer = require('@nahanil/zh-tokenizer')('./cedict.txt')
console.log(tokenizer.tokenize('我是中国人。'))
const tokenizer = require('@nahanil/zh-tokenizer')('./cedict.txt', 'traditional')
console.log(tokenizer.tokenize('我是中國人。'))
Output:
[ { traditional: '我',
simplified: '我',
pinyin: 'wo3',
pinyinPretty: 'wǒ',
english: 'I/me/my' },
{ traditional: '是',
simplified: '是',
pinyin: 'shi4',
pinyinPretty: 'shì',
english: 'is/are/am/yes/to be\nvariant of 是[shi4]/(used in given names)' },
{ traditional: '中國人',
simplified: '中国人',
pinyin: 'zhong1 guo2 ren2',
pinyinPretty: 'zhōng guó rén',
english: 'Chinese person' },
{ traditional: '。',
simplified: '。',
pinyin: null,
pinyinPretty: null,
english: null } ]