hardcoded-language-detector
v1.0.6
Published
NPM library for detecting the language of given text
Downloads
460
Maintainers
Readme
hardcoded-language-detector
A powerful script family detection library that analyzes text using Unicode ranges to determine the ratio of different writing systems present in the text.
Features
- Fast and accurate script family detection using Unicode ranges
- Returns ratios of different script families in the text
- Identifies the dominant script family
- Supports mixed script detection
- Handles special characters and numbers
- No external dependencies
- Lightweight (~20KB)
Installation
npm install hardcoded-language-detector
Supported Script Families
Each script family is represented by a two-letter code:
Latin (la) - Basic Latin, Extended Latin-A to E
- English, French, German, Spanish, Portuguese, Vietnamese, Turkish, etc.
- Includes diacritics and special characters used in European languages
Cyrillic (cy) - Cyrillic and Extensions
- Russian, Ukrainian, Bulgarian, Serbian, Belarusian, etc.
Arabic (ar) - Arabic and Extensions
- Arabic, Persian (Farsi), Urdu, Kurdish, Sindhi
- Includes all Arabic presentation forms and supplements
Devanagari (de) - Devanagari and Extensions
- Hindi, Marathi, Sanskrit, Nepali, etc.
Brahmic (br) - Various Brahmic family scripts
- Bengali, Tamil, Telugu, Kannada, Malayalam
- Gujarati, Gurmukhi (Punjabi), Oriya, Sinhala
Han (hz) - CJK Unified Ideographs
- Chinese (Traditional & Simplified)
- Japanese Kanji
- Korean Hanja
- Includes all CJK extensions (A through H)
Kana (kn) - Japanese syllabaries
- Hiragana
- Katakana (including half-width forms)
- Phonetic extensions
Hangul (hn) - Korean writing system
- Modern Hangul syllables
- Archaic Korean letters
- Compatibility Jamo
- Half-width forms
Thai (th) - Thai script
- Thai language characters
- Thai digits and symbols
Hebrew (he) - Hebrew script
- Modern Hebrew
- Biblical Hebrew
- Includes presentation forms
Greek (gr) - Greek and Coptic
- Modern Greek
- Ancient Greek
- Extended Greek
- Ancient Greek numbers
Unknown (un) - Unrecognized scripts or special characters
- Numbers
- Punctuation marks
- Special symbols
- Emojis
- Other Unicode characters not in above categories
Usage
const detectScriptFamily = require('hardcoded-language-detector');
// Single script
console.log(detectScriptFamily('Hello World'));
// Output: { top: 'la', la: 1 }
console.log(detectScriptFamily('안녕하세요'));
// Output: { top: 'hn', hn: 1 }
// Mixed scripts
console.log(detectScriptFamily('Hello 안녕 こんにちは'));
// Output: { top: 'la', la: 0.33, hn: 0.33, kn: 0.34 }
// Special cases
console.log(detectScriptFamily('123!@#'));
// Output: { top: 'un', un: 1 }
console.log(detectScriptFamily(''));
// Output: { top: 'un', un: 1 }
Return Value Format
The function returns an object with:
top
: The dominant script family code (highest ratio)- Script family codes as keys with their ratios as values
- Ratios are rounded to 2 decimal places
- Only ratios >= 0.01 (1%) are included
- Unknown or special characters are marked as 'un'
Error Handling
- Empty strings return
{ top: 'un', un: 1 }
- Strings with only numbers/special characters return
{ top: 'un', un: 1 }
- Invalid input (null/undefined) returns
{ top: 'un', un: 1 }
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Note on Code Blocks
When using this library in documentation, please be careful with code blocks containing CJK characters. Some markdown processors might have issues with Unicode characters in code blocks. Always test the documentation rendering with CJK examples.