npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

taibun

v1.1.3

Published

Taiwanese Hokkien Transliterator and Tokeniser

Downloads

46

Readme

台語 | 國語

Taibun.js

Contributions Live Demo Tests Release Licence LinkedIn Downloads

Taiwanese Hokkien Transliterator and Tokeniser

It has methods that allow to customise transliteration and retrieve any necessary information about Taiwanese Hokkien pronunciation. Includes word tokeniser for Taiwanese Hokkien.

Report Bugnpm


Versions

Python Version

Install

Taibun can be installed from npm

$ npm install taibun --save

Usage

Converter

Converter class transliterates the Chinese characters to the chosen transliteration system with parameters specified by the developer. Works for both Traditional and Simplified characters.

// Constructor
c = new Converter({ system, dialect, format, delimiter, sandhi, punctuation, convertNonCjk });

// Transliterate Chinese characters
c.get(input);

System

system String - system of transliteration.

| text | Tailo | POJ | Zhuyin | TLPA | Pingyim | Tongiong | IPA | | ---- | ------- | ------- | ----------- | --------- | ------- | -------- | ----------- | | 台灣 | Tâi-uân | Tâi-oân | ㄉㄞˊ ㄨㄢˊ | Tai5 uan5 | Dáiwán | Tāi-uǎn | Tai²⁵ uan²⁵ |

Dialect

dialect String - preferred pronunciation.

| text | south | north | singapore | | -------------- | --------------------------- | --------------------------- | -------------------------- | | 五月節我啉咖啡 | Gōo-gue̍h-tseh guá lim ka-pi | Gōo-ge̍h-tsueh guá lim ka-pi | Gōo-ge̍h-tsueh uá lim ko-pi |

Format

format String - format in which tones will be represented in the converted sentence.

  • mark (default) - uses diacritics for each syllable. Not available for TLPA
  • number - add a number which represents the tone at the end of the syllable
  • strip - removes any tone marking

| text | mark | number | strip | | ---- | ------- | --------- | ------- | | 台灣 | Tâi-uân | Tai5-uan5 | Tai-uan |

Delimiter

delimiter String - sets the delimiter character that will be placed in between syllables of a word.

Default value depends on the chosen system:

  • '-' - for Tailo, POJ, Tongiong
  • '' - for Pingyim
  • ' ' - for Zhuyin, TLPA, IPA

| text | '-' | '' | ' ' | | ---- | ------- | ------ | ------- | | 台灣 | Tâi-uân | Tâiuân | Tâi uân |

Sandhi

sandhi String - applies the sandhi rules of Taiwanese Hokkien.

Since it's difficult to encode all sandhi rules, Taibun provides multiple modes for sandhi conversion to allow for customised sandhi handling.

  • none - doesn't perform any tone sandhi
  • auto - closest approximation to full correct tone sandhi of Taiwanese, with proper sandhi of pronouns, suffixes, and words with 仔
  • excLast - changes tone for every syllable except for the last one
  • inclLast - changes tone for every syllable including the last one

Default value depends on the chosen system:

  • auto - for Tongiong
  • none - for Tailo, POJ, Zhuyin, TLPA, Pingyim, IPA

| text | none | auto | excLast | inclLast | | ---------------- | ----------------------- | ---------------------- | ---------------------- | ---------------------- | | 這是你的茶桌仔無 | Tse sī lí ê tê-toh-á bô | Tse sì li ē tē-to-á bô | Tsē sì li ē tē-tó-a bô | Tsē sì li ē tē-tó-a bō |

Sandhi rules also change depending on the dialect chosen.

| text | no sandhi | south | north / singapore | | ---- | --------- | ------- | ----------------- | | 台灣 | Tâi-uân | Tāi-uân | Tài-uân |

Punctuation

punctuation String

  • format (default) - converts Chinese-style punctuation to Latin-style punctuation and capitalises words at the beginning of each sentence
  • none - preserves Chinese-style punctuation and doesn't capitalise words at the beginning of new sentences

| text | format | none | | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- | | 這是臺南,簡稱「南」(白話字:Tâi-lâm;注音符號:ㄊㄞˊ ㄋㄢˊ,國語:Táinán)。 | Tse sī Tâi-lâm, kán-tshing "lâm" (Pe̍h-uē-jī: Tâi-lâm; tsù-im hû-hō: ㄊㄞˊ ㄋㄢˊ, kok-gí: Táinán). | tse sī Tâi-lâm,kán-tshing「lâm」(Pe̍h-uē-jī:Tâi-lâm;tsù-im hû-hō:ㄊㄞˊ ㄋㄢˊ,kok-gí:Táinán)。 |

Convert non-CJK

convertNonCjk Boolean - defines whether or not to convert non-Chinese words. Can be used to convert Tailo to another romanisation system.

  • true - convert non-Chinese character words
  • false (default) - convert only Chinese character words

| text | false | true | | --------- | ----------------------- | ----------------------- | | 我食pháng | ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng | ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ |

Tokeniser

Tokeniser class performs NLTK wordpunct_tokenize-like tokenisation of a Taiwanese Hokkien sentence.

// Constructor
t = new Tokeniser(keepOriginal);

// Tokenise Taiwanese Hokkien sentence
t.tokenise(input);

Keep original

keepOriginal Boolean - defines whether the original characters of the input are retained.

  • true (default) - preserve original characters
  • false - replace original characters with characters defined in the dataset

| text | true | false | | ------------ | -------------------- | -------------------- | | 臺灣火鸡肉饭 | ['臺灣', '火鸡肉饭'] | ['台灣', '火雞肉飯'] |

Other Functions

Handy functions for NLP tasks in Taiwanese Hokkien.

toTraditional function converts input to Traditional Chinese characters that are used in the dataset. Also accounts for different variants of Traditional Chinese characters.

toSimplified function converts input to Simplified Chinese characters.

isCjk function checks whether the input string consists entirely of Chinese characters.

toTraditional(input);

toSimplified(input);

isCjk(input);

Example

// Converter
const { Converter } = require('taibun');

//// System
c = new Converter(); // Tailo system default
c.get('先生講,學生恬恬聽。');
>> Sian-sinn kóng, ha̍k-sing tiām-tiām thiann.

c = new Converter({ system: 'Zhuyin' });
c.get('先生講,學生恬恬聽。');
>> ㄒㄧㄢ ㄒㆪ ㄍㆲˋ, ㄏㄚㆶ˙ ㄒㄧㄥ ㄉㄧㆰ˫ ㄉㄧㆰ˫ ㄊㄧㆩ.

//// Dialect
c = new Converter(); // south dialect default
c.get("我欲用箸食魚");
>> Guá beh īng tī tsia̍h hî

c = new Converter({ dialect: 'north' });
c.get("我欲用箸食魚");
>> Guá bueh īng tū tsia̍h hû

c = new Converter({ dialect: 'singapore' });
c.get("我欲用箸食魚");
>> Uá bueh ēng tū tsia̍h hû

//// Format
c = new Converter(); // for Tailo, mark by default
c.get("生日快樂");
>> Senn-ji̍t khuài-lo̍k

c = new Converter({ format: 'number' });
c.get("生日快樂");
>> Senn1-jit8 khuai3-lok8

c = new Converter({ format: 'strip' });
c.get("生日快樂");
>> Senn-jit khuai-lok

//// Delimiter
c = new Converter({ delimiter: '' });
c.get("先生講,學生恬恬聽。");
>> Siansinn kóng, ha̍ksing tiāmtiām thiann.

c = new Converter({ system: 'Pingyim', delimiter: '-' });
c.get("先生講,學生恬恬聽。");
>> Siān-snī gǒng, hág-sīng diâm-diâm tinā.

//// Sandhi
c = new Converter(); // for Tailo, sandhi none by default
c.get("這是你的茶桌仔無");
>> Tse sī lí ê tê-toh-á bô

c = new Converter({ sandhi: 'auto' });
c.get("這是你的茶桌仔無");
>> Tse sì li ē tē-to-á bô

c = new Converter({ sandhi: 'excLast' });
c.get("這是你的茶桌仔無");
>> Tsē sì li ē tē-tó-a bô

c = new Converter({ sandhi: 'inclLast' });
c.get("這是你的茶桌仔無");
>> Tsē sì li ē tē-tó-a bō

//// Punctuation
c = new Converter(); // format punctuation default
c.get("太空朋友,恁好!恁食飽未?");
>> Thài-khong pîng-iú, lín-hó! Lín tsia̍h-pá buē?

c = new Converter({ punctuation: 'none' });
c.get("太空朋友,恁好!恁食飽未?");
>> thài-khong pîng-iú,lín-hó!lín tsia̍h-pá buē?

//// Convert non-CJK
c = new Converter({ system: 'Zhuyin' }); // false convertNonCjk default
c.get("我食pháng");
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng

c = new Converter({ system: 'Zhuyin', convertNonCjk: true });
c.get("我食pháng");
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ


// Tokeniser
const { Tokeniser } = require('taibun');

t = new Tokeniser();
t.tokenise("太空朋友,恁好!恁食飽未?");
>> ['太空', '朋友', ',', '恁好', '!', '恁', '食飽', '未', '?']

//// Keep Original
t = new Tokeniser(); // true keepOriginal default
t.tokenise("爲啥物臺灣遮爾好?");
>> ['爲啥物', '臺灣', '遮爾', '好', '?']

t.tokenise("为啥物台湾遮尔好?");
>> ['为啥物', '台湾', '遮尔', '好', '?']

t = new Tokeniser(false);
t.tokenise("爲啥物臺灣遮爾好?");
>> ['為啥物', '台灣', '遮爾', '好', '?']

t.tokenise("为啥物台湾遮尔好?");
>> ['為啥物', '台灣', '遮爾', '好', '?']


// Other Functions
const { toTraditional, toSimplified, isCjk } = require('taibun');

//// toTraditional
toTraditional("我听无台语");
>> 我聽無台語

toTraditional("我爱这个个人台面");
>> 我愛這个個人檯面

toTraditional("爲啥物");
>> 為啥物

//// toSimplified
toSimplified("我聽無台語");
>> 我听无台语

//// isCjk
isCjk('我食麭')
>> true

isCjk('我食pháng');
>> false

Data

Acknowledgements

Licence

Because Taibun is MIT-licensed, any developer can essentially do whatever they want with it as long as they include the original copyright and licence notice in any copies of the source code. Note, that the data used by the package is licensed under a different copyright.

The data is licensed under CC BY-SA 4.0