npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

zhconv

v0.3.1

Published

🦀 Convert Trad/Simp Chinese and regional phrases among each other based on rules of MediaWiki and OpenCC 轉換中文简繁體及地區詞,基於MediaWiki和OpenCC之轉換表 (powered by Rust/WASM 驅動)

Downloads

23

Readme

CI status docs.rs Crates.io PyPI version NPM version

zhconv-rs 中文简繁及地區詞轉換

zhconv-rs converts Chinese text among traditional/simplified scripts or regional variants (e.g. zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant), built on the top of rulesets from MediaWiki/Wikipedia and OpenCC.

The implementation is powered by the Aho-Corasick algorithm, ensuring linear time complexity with respect to the length of input text and conversion rules (O(n+m)), processing dozens of MiBs text per second.

🔗 Web App: https://zhconv.pages.dev (powered by WASM)

⚙️ Cli: cargo install zhconv-cli or check releases.

🦀 Rust Crate: cargo add zhconv (check docs for examples)

🐍 Python Package via PyO3: pip install zhconv-rs (WASM with wheels)

# > pip install zhconv_rs
# Convert with builtin rulesets:
from zhconv_rs import zhconv
assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
assert zhconv("霧失樓臺,月迷津渡", "zh-hans") == "雾失楼台,月迷津渡"
assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"

# Convert with custom rules:
from zhconv_rs import make_converter
assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"

import io
convert = make_converter("zh-hans", io.StringIO("䖏 处\n罨畫 掩画")) # or path to rule file
assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去湖州近 几处楼台掩画间"

JS (Webpack): npm install zhconv or yarn add zhconv (WASM, instructions)

JS in browser: https://cdn.jsdelivr.net/npm/zhconv-web@latest (WASM)

<script type="module">
    // Use ES module import syntax to import functionality from the module
    // that we have compiled.
    //
    // Note that the `default` import is an initialization function which
    // will "boot" the module and make it ready to use. Currently browsers
    // don't support natively imported WebAssembly as an ES module, but
    // eventually the manual initialization won't be required!
    import init, { zhconv } from 'https://cdn.jsdelivr.net/npm/zhconv-web@latest/zhconv.js'; // specify a version tag if in prod

    async function run() {
        await init();

        alert(zhconv(prompt("Text to convert to zh-hans:"), "zh-hans"));
    }

    run();
</script>

Supported variants

| Target | Tag | Script | Description | | -------------------------------------- | --------- | ------- | --------------------------------------------- | | Simplified Chinese / 简体中文 | zh-Hans | SC / 简 | W/O substituing region-specific phrases. | | Traditional Chinese / 繁體中文 | zh-Hant | TC / 繁 | W/O substituing region-specific phrases. | | Chinese (Taiwan) / 臺灣正體 | zh-TW | TC / 繁 | With Taiwan-specific phrases adapted. | | Chinese (Hong Kong) / 香港繁體 | zh-HK | TC / 繁 | With Hong Kong-specific phrases adapted. | | Chinese (Macau) / 澳门繁體 | zh-MO | TC / 繁 | Same as zh-HK for now. | | Chinese (Mainland China) / 大陆简体 | zh-CN | SC / 简 | With mainland China-specific phrases adapted. | | Chinese (Singapore) / 新加坡简体 | zh-SG | SC / 简 | Same as zh-CN for now. | | Chinese (Malaysia) / 大马简体 | zh-MY | SC / 简 | Same as zh-CN for now. |

Note: zh-TW and zh-HK are based on zh-Hant. zh-CN are based on zh-Hans. Currently, zh-MO shares the same rulesets with zh-HK unless additional rules are manually configured; zh-MY and zh-SG shares the same rulesets with zh-CN unless additional rules are manually configured.

Performance

cargo bench on Intel(R) Xeon(R) CPU @ 2.80GHz (GitPod) by v0.2, without parsing inline conversion rules:

load zh2Hant            time:   [45.442 ms 45.946 ms 46.459 ms]
load zh2Hans            time:   [8.1378 ms 8.3787 ms 8.6414 ms]
load zh2TW              time:   [60.209 ms 61.261 ms 62.407 ms]
load zh2HK              time:   [89.457 ms 90.847 ms 92.297 ms]
load zh2MO              time:   [96.670 ms 98.063 ms 99.586 ms]
load zh2CN              time:   [27.850 ms 28.520 ms 29.240 ms]
load zh2SG              time:   [28.175 ms 28.963 ms 29.796 ms]
load zh2MY              time:   [27.142 ms 27.635 ms 28.143 ms]
zh2TW data54k           time:   [546.10 us 553.14 us 561.24 us]
zh2CN data54k           time:   [504.34 us 511.22 us 518.59 us]
zh2Hant data689k        time:   [3.4375 ms 3.5182 ms 3.6013 ms]
zh2TW data689k          time:   [3.6062 ms 3.6784 ms 3.7545 ms]
zh2Hant data3185k       time:   [62.457 ms 64.257 ms 66.099 ms]
zh2TW data3185k         time:   [60.217 ms 61.348 ms 62.556 ms]
zh2TW data55m           time:   [1.0773 s 1.0872 s 1.0976 s]

The benchmark was performed on a previous version that had only Mediawiki rulesets. In the newer version, with OpenCC rulesets activated by default, the performance may degrade ~2x. Since v0.3, the Aho-Corasick algorithm implementation has been switched to daachorse with automaton prebuilt during compile time. The performance is no worse than the previous version, even though OpenCC rulesets are newly introduced.

Be noted that, OpenCC rulesets accounts for at least several MiBs in build output. If that looks too big, just overwrite the default features (e.g. zhconv = { version = "...", features = [ "compress" ] }).

Differences with other converters

  • ZhConver{sion,ter}.php of MediaWiki: zhconv-rs just takes conversion tables listed in ZhConversion.php. MediaWiki relies on the inefficient PHP built-in function strtr. Under the basic mode, zhconv-rs guarantees linear time complexity (T = O(n+m) instead of O(nm)) and single-pass scanning of input text. Optionally, zhconv-rs supports the same conversion rule syntax with MediaWiki.
  • OpenCC: The conversion rulesets of OpenCC is independent of MediaWiki. The core conversion implementation of OpenCC is kinda similar to the aforementioned strtr. However, OpenCC supports pre-segmentation and maintains multiple rulesets which are applied successively. By contrast, the Aho-Corasick-powered zhconv-rs merges rulesets from MediaWiki and OpenCC in compile time and converts text in single-pass linear time, resulting in much more efficiency. Though, conversion results may differ in some cases.

Limitations

The converter takes leftmost-longest matching strategy. It gives priority to the leftmost-matched words or phrases. For instance, if a ruleset includes both 干 -> 幹 and 天干物燥 -> 天乾物燥, the converter would prioritize 天乾物燥 because 天干物燥 gets matched earlier compared to at a later position. The strategy yields good results in general, but may occasionally lead to wrong conversions.

The implementation support most of the MediaWiki conversion rules. But it is not fully compliant with the original implementation.

Besides, for wikitext, if input text contains global conversion rules (in MediaWiki syntax like -{H|zh-hans:鹿|zh-hant:马}-), the time complexity of the implementation may degrade to O(n*m) where n and m are the length of the text and the maximum lengths of sources words in conversion rulesets in the worst case (equivalent to brute-force).

Credits

All rulesets that power the converter come from the MediaWiki project and OpenCC.

The project takes the following projects/pages as references:

  • https://github.com/gumblex/zhconv : Python implementation of zhConver{ter,sion}.php.
  • https://github.com/BYVoid/OpenCC/ : Widely adopted Chinese converter.
  • https://zh.wikipedia.org/wiki/Wikipedia:字詞轉換處理
  • https://zh.wikipedia.org/wiki/Help:高级字词转换语法
  • https://github.com/wikimedia/mediawiki/blob/master/includes/language/LanguageConverter.php

TODO

  • [x] Support Module:CGroup
  • [ ] Propogate error properly with Anyhow and thiserror
  • [x] Python lib
  • [x] More exmaples in README
  • [x] Add rulesets from OpenCC