sirrobert-tokenize
v1.0.2
Published
A utility module for tokenizing a string.
Downloads
5
Maintainers
Readme
Rationale
A module for string tokenization after a particular pattern.
Namespace rationale
This module exists in the sirrobert-
namespace so as not to clutter npm
and to keep my related packages together. If there's enough interest, I can
move this into the general namespace.
Installation
Local installation
npm install --save sirrobert-tokenize
Global installation
npm install --global sirrobert-tokenize
Token Definitions
This module contains only one function, called "tokenize". It takes a string and returns an array of tokens.
Tokens are defined by the module as one of:
Quoted Strings. Quoted strings include single- or double-quoted strings. Double quoted strings are defined as
/("[^"\\]*(?:\\.[^"\\]*)*")/
and single-quoted strings are defined similarly as
/('[^'\\]*(?:\\.[^'\\]*)*')/
Note that this means that escaped quotes inside a quoted string are permitted, so these would be valid tokens:
"\""
,'\''
Consecutive Non-Whitespace. Any whitespace (or string boundary) sequence. That means the following are examples of valid tokens of this kind:
pickle
,potato-pants
,249nf9W$GH(WGOSWJEUR
. The definition of this kind of token is:/\S+/
Consecutive Whitespace. Any consecutive whitespace. The specific definition of this kind of token is:
/\s+/
Usage
This module provides one function that takes a string and gives an array of tokens as defined above.
The usage pattern goes: tokenize(string, [options-hash])
let tokenize = require("sirrobert-tokenize");
let str = "I am the \"Egg Man\" and the 'the \'Walrus\''";
tokenize(str);
/* ['I',
* 'am',
* 'the',
* '"Egg Man"',
* 'and',
* 'the \'Walrus\''
* ]
*/
Options
There are two options available for the options hash:
whitespace
How to handle whitespace in the string. Available values are:ignore
Disregard all whitespace. This is the default.append
Keep all whitespace. Append it to the token it comes immediately after. Whitespace at the beginning of the string is its own token.prepend
Keep all whitespace. Prepend it to the token it comes immediately before. Whitespace at the end of the string is its own token.tokenize
Keep all whitespace. Each set of whitespace is its own token.
trimInput
Whether to trim whitespace from the input string before processing. Defaults totrue
. Any values are evaluated as boolean values.
Here are examples of various combinations of options using the string above.
{ whitespace: 'ignore', trimInput: true }
[ 'I',
'am',
'the',
'"Egg Man"',
'and',
'the',
'\'the \'',
'Walrus\'\'' ]
{ whitespace: 'ignore', trimInput: false }
[ 'I',
'am',
'the',
'"Egg Man"',
'and',
'the',
'\'the \'',
'Walrus\'\'' ]
{ whitespace: 'append', trimInput: true }
[ 'I ',
'am ',
'the ',
'"Egg Man" ',
'and ',
'the ',
'\'the \'',
'Walrus\'\'' ]
{ whitespace: 'append', trimInput: false }
[ ' ',
'I ',
'am ',
'the ',
'"Egg Man" ',
'and ',
'the ',
'\'the \'',
'Walrus\'\' ' ]
{ whitespace: 'prepend', trimInput: true }
[ 'I',
' am',
' the',
' "Egg Man"',
' and',
' the',
' \'the \'',
'Walrus\'\'' ]
{ whitespace: 'prepend', trimInput: false }
[ ' I',
' am',
' the',
' "Egg Man"',
' and',
' the',
' \'the \'',
'Walrus\'\'',
' ' ]
{ whitespace: 'tokenize', trimInput: true }
[ 'I',
' ',
'am',
' ',
'the',
' ',
'"Egg Man"',
' ',
'and',
' ',
'the',
' ',
'\'the \'',
'Walrus\'\'' ]
{ whitespace: 'tokenize', trimInput: false }
[ ' ',
'I',
' ',
'am',
' ',
'the',
' ',
'"Egg Man"',
' ',
'and',
' ',
'the',
' ',
'\'the \'',
'Walrus\'\'',
' ' ]
LICENSE
MIT