mnl-ws-norm
v1.0.3
Published
Light-weight tool for normalizing whitespace, splitting lines, and accurately tokenizing words (no regex). Multiple natural languages supported.
Downloads
4
Maintainers
Readme
Light-weight tool for normalizing whitespace and accurately tokenizing words (no regex). Multiple natural languages supported. Useful for scrapping, machine learning, and data analysis.
Installation
npm install mnl-ws-norm
function isWhiteSpace(char)
returns true if char is a whitespace character.
char must be passed as a string with a length of 1.
import {isWhiteSpace} from 'mnl-ws-norm';
console.log("Half-width space isWhiteSpace: " + isWhiteSpace(" "));
console.log("Tab is white space: " + isWhiteSpace(" "));
console.log("'A' is white space: " + isWhiteSpace("A"));
console.log("'\\n' is white space: " + isWhiteSpace("\n"));
function isLineBreak(char)
returns true if char is a line break.
char must be passed as a string with a length of 1.
import {isLineBreak} from 'mnl-ws-norm';
console.log("'\\n' is line break: " + isLineBreak("\n"));
console.log("Half-width space is line break: " + isLineBreak(" "));
function splitBySpaces(inputStr)
inputStr is the string from which words are to be tokenized.
inputStr must be passed as a string.
Note: This function splits inputStr by all whitespace characters (spaces, line breaks, etc.).
import {splitBySpaces} from 'mnl-ws-norm';
//Source string 1 with half-width spaces (Unicode: U+0020) and a tab (Unicode: U+0009).
var sourceStr1 = "Hey, everybody, how are you doing?";
//Source string 2 with half-width spaces and a \n character (Unicode: U+000A).
var sourceStr2 = "Hey, everybody\nhow are you doing?";
//Source string 3 with half-width spaces and a full-width space (Unicode: U+3000).
var sourceStr3 = "Hey, everybody, how are you doing?";
//The join method is used in this example to separate the elements in the returned array.
console.log("sourceStr1: " + splitBySpaces(sourceStr1).join("-"));
console.log("sourceStr2: " + splitBySpaces(sourceStr2).join("-"));
console.log("sourceStr3: " + splitBySpaces(sourceStr3).join("-"));
function splitByLines(inputStr, removeExtraSpaces = false)
Required parameter -> inputStr
inputStr is the string from which lines are to be tokenized.
inputStr must be passed as a string.
Optional parameter -> removeExtraSpaces
By default, leading/trailing spaces are not removed from lines. Specifying removeExtraSpace as true removes leading/trailing spaces.
import {splitBySpaces, splitByLines} from 'mnl-ws-norm';
var sourceStr = "Hey, everybody.\nHow are you doing?\rI am alright.";
var lines = splitByLines(sourceStr);
//The join method is used in this example to separate the elements in the returned array.
for (var i = 0; i < lines.length; i++) {
console.log("Line " + i.toString() + " : " + (splitBySpaces(lines[i])).join("-"));
}
function normSpaces(inputStr, spaceType, removeExtraSpaces = false)
Required parameters -> inputStr, spaceType
inputStr is the string in which the whitespace characters are to be replaced.
inputStr must be passed as a string.
spaceType is the string used to replace all whitespace characters in inputStr.
spaceType must be passed as a string.
Optional parameter -> removeExtraSpaces
By default, extra whitespace characters are not removed from inputStr.
Specifying removeExtraSpaces as true removes extra whitespace characters from inputStr.
Note: Regardless of the value of removeExtraSpaces, the returned string may have leading/trailing whitespace characters, so you may want to use the trim() method as necessary.
import {normSpaces} from 'mnl-ws-norm';
//Source string with consecutive half-width spaces (Unicode: U+0020) and a tab (Unicode: U+0009).
var sourceStr = " Hey, everybody, how are you doing? ";
//Spaces in sourceStr are replaced with a half-width space, while extra spaces are ignored.
console.log(normSpaces(sourceStr, " "));
//Spaces in sourceStr are replaced with a half-width space, and extra spaces are removed.
console.log(normSpaces(sourceStr, " ", true));
//Spaces in source_str are replaced with a full-width space (Unicode: U+3000), and extra spaces are removed.
console.log(normSpaces(sourceStr, " ", true));
Other languages
- Python -> https://github.com/Rairye/mnl-ws-norm