hardcoded-language-detector

v1.0.6

Published

13 days ago

NPM library for detecting the language of given text

Downloads

460

0High
0Medium
0Low

melodysdreamj

language detection npm library

hardcoded-language-detector

A powerful script family detection library that analyzes text using Unicode ranges to determine the ratio of different writing systems present in the text.

Features

Fast and accurate script family detection using Unicode ranges
Returns ratios of different script families in the text
Identifies the dominant script family
Supports mixed script detection
Handles special characters and numbers
No external dependencies
Lightweight (~20KB)

Installation

npm install hardcoded-language-detector

Supported Script Families

Each script family is represented by a two-letter code:

Latin (la) - Basic Latin, Extended Latin-A to E
- English, French, German, Spanish, Portuguese, Vietnamese, Turkish, etc.
- Includes diacritics and special characters used in European languages
Cyrillic (cy) - Cyrillic and Extensions
- Russian, Ukrainian, Bulgarian, Serbian, Belarusian, etc.
Arabic (ar) - Arabic and Extensions
- Arabic, Persian (Farsi), Urdu, Kurdish, Sindhi
- Includes all Arabic presentation forms and supplements
Devanagari (de) - Devanagari and Extensions
- Hindi, Marathi, Sanskrit, Nepali, etc.
Brahmic (br) - Various Brahmic family scripts
- Bengali, Tamil, Telugu, Kannada, Malayalam
- Gujarati, Gurmukhi (Punjabi), Oriya, Sinhala
Han (hz) - CJK Unified Ideographs
- Chinese (Traditional & Simplified)
- Japanese Kanji
- Korean Hanja
- Includes all CJK extensions (A through H)
Kana (kn) - Japanese syllabaries
- Hiragana
- Katakana (including half-width forms)
- Phonetic extensions
Hangul (hn) - Korean writing system
- Modern Hangul syllables
- Archaic Korean letters
- Compatibility Jamo
- Half-width forms
Thai (th) - Thai script
- Thai language characters
- Thai digits and symbols
Hebrew (he) - Hebrew script
- Modern Hebrew
- Biblical Hebrew
- Includes presentation forms
Greek (gr) - Greek and Coptic
- Modern Greek
- Ancient Greek
- Extended Greek
- Ancient Greek numbers
Unknown (un) - Unrecognized scripts or special characters
- Numbers
- Punctuation marks
- Special symbols
- Emojis
- Other Unicode characters not in above categories

Usage

const detectScriptFamily = require('hardcoded-language-detector');

// Single script
console.log(detectScriptFamily('Hello World'));
// Output: { top: 'la', la: 1 }

console.log(detectScriptFamily('안녕하세요'));
// Output: { top: 'hn', hn: 1 }

// Mixed scripts
console.log(detectScriptFamily('Hello 안녕 こんにちは'));
// Output: { top: 'la', la: 0.33, hn: 0.33, kn: 0.34 }

// Special cases
console.log(detectScriptFamily('123!@#'));
// Output: { top: 'un', un: 1 }

console.log(detectScriptFamily(''));
// Output: { top: 'un', un: 1 }

Return Value Format

The function returns an object with:

top: The dominant script family code (highest ratio)
Script family codes as keys with their ratios as values
Ratios are rounded to 2 decimal places
Only ratios >= 0.01 (1%) are included
Unknown or special characters are marked as 'un'

Error Handling

Empty strings return { top: 'un', un: 1 }
Strings with only numbers/special characters return { top: 'un', un: 1 }
Invalid input (null/undefined) returns { top: 'un', un: 1 }

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Note on Code Blocks

When using this library in documentation, please be careful with code blocks containing CJK characters. Some markdown processors might have issues with Unicode characters in code blocks. Always test the documentation rendering with CJK examples.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

hardcoded-language-detector

Features

Installation

Supported Script Families

Usage

Return Value Format

Error Handling

License

Contributing

Note on Code Blocks