@shelf/text-normalizer
v2.0.1
Published
Text normalizer initially done for openai/whisper but ported to TS with love by shelf.io!
Downloads
1,484
Maintainers
Keywords
Readme
text-normalizer
Originally took from openai/whisperer and rewrote to TS
TypeScript library for normalizing English text. It provides a utility class EnglishTextNormalizer
with methods for normalizing various types of text, such as contractions, abbreviations, and spacing.
EnglishTextNormalizer
consists of other classes you can reuse independently:
EnglishSpellingNormalizer
- uses a dictionary of English words and their American spelling. The dictionary is stored in a JSON file named english.jsonEnglishNumberNormalizer
- works specifically to normalize text from English words to actually numbersBasicTextNormalizer
- provides methods for removing special characters and diacritics from text, as well as splitting words into separate letters.
Install
$ yarn add @shelf/text-normalizer
Usage
Node.js
import {EnglishTextNormalizer} from '@shelf/text-normalizer';
const normalizer = new EnglishTextNormalizer();
console.log(normalizer.normalize("Let's")); // Output: let us
console.log(normalizer.normalize("he's like")); // Output: he is like
console.log(normalizer.normalize("she's been like")); // Output: she has been like
console.log(normalizer.normalize('10km')); // Output: 10 km
console.log(normalizer.normalize('10mm')); // Output: 10 mm
console.log(normalizer.normalize('RC232')); // Output: rc 232
console.log(normalizer.normalize('Mr. Park visited Assoc. Prof. Kim Jr.')); // Output: mister park visited associate professor kim junior
Browser
import {EnglishTextNormalizer} from 'https://esm.sh/@shelf/text-normalizer';
const normalizer = new EnglishTextNormalizer();
console.log(normalizer.normalize("Let's")); // Output: let us
console.log(normalizer.normalize("he's like!")); // Output: he is like
Advanced Usage
Using EnglishNumberNormalizer
import {EnglishNumberNormalizer} from '@shelf/text-normalizer';
const numberNormalizer = new EnglishNumberNormalizer();
console.log(numberNormalizer.normalize('twenty-five')); // Output: 25
console.log(numberNormalizer.normalize('three million')); // Output: 3000000
console.log(numberNormalizer.normalize('two and a half')); // Output: 2.5
console.log(numberNormalizer.normalize('fifty percent')); // Output: 50%
Using EnglishSpellingNormalizer
import {EnglishSpellingNormalizer} from '@shelf/text-normalizer';
const spellingNormalizer = new EnglishSpellingNormalizer();
console.log(spellingNormalizer.normalize('colour')); // Output: color
console.log(spellingNormalizer.normalize('organise')); // Output: organize
Using BasicTextNormalizer
import {BasicTextNormalizer} from '@shelf/text-normalizer';
const basicNormalizer = new BasicTextNormalizer(true, true);
console.log(basicNormalizer.normalize('Café!')); // Output: c a f e
console.log(basicNormalizer.normalize('Hello [World]')); // Output: h e l l o
Configuration
BasicTextNormalizer
The BasicTextNormalizer
constructor accepts two optional boolean parameters:
removeDiacritics
(default:false
): If set totrue
, diacritics will be removed from the text.splitLetters
(default:false
): If set totrue
, letters will be split into individual characters.
Example:
const normalizer = new BasicTextNormalizer(true, true);
Performance Considerations
- The
EnglishTextNormalizer
combines multiple normalization techniques and may be slower for very large texts. Consider using individual normalizers (EnglishNumberNormalizer
,EnglishSpellingNormalizer
, orBasicTextNormalizer
) if you only need specific functionality. - For repeated normalization of large amounts of text, consider initializing the normalizer once and reusing it to avoid unnecessary setup time.
Related Projects
- compromise - Natural language processing in JavaScript
Publish
$ git checkout master
$ yarn version
$ yarn publish
$ git push origin master --tags
License
MIT © Shelf