@coder-seb/tokenizer
v1.0.10
Published
Lexical tokenizer with grammar classes ready to go.
Downloads
11
Readme
Tokenizer
This lexical tokenizer comes with ready to go grammar objects.
Grammar objects
- WordAndDot - Tokens: [WORD, DOT].
- Arithmetic - Tokens: [NUMBER, ADD, SUB, DIV, MUL, EQUAL, OPENING, CLOSING].
- MaximalMunch - Tokens: [INTEGER, FLOAT].
- Exclamation - Tokens: [EXCLAMATION].
Instructions
Installing
npm install @coder-seb/tokenizer
Creating a new Tokenizer object
import { Tokenizer, WordAndDot } from '@coder-seb/tokenizer'
const textGrammar = new WordAndDot()
const textTokenizer = new Tokenizer(textGrammar, 'This is a string.')
Get all tokens
const textGrammar = new WordAndDot()
const textTokenizer = new Tokenizer(textGrammar, 'This is a string.')
while (textTokenizer.hasNextToken()) {
textTokenizer.setNextToken()
}
console.log(textTokenizer.getTokens())
Output:
[
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'This' },
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'is' },
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'a' },
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'string' },
{ Token: 'DOT', Regex: /^\./, Value: '.' },
{ Token: 'END', Regex: 'END', Value: 'END' }
]
Get active token (sequence [])
console.log(textTokenizer.getActiveToken())
Output:
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'This' }
Get next token (sequence [>])
textTokenizer.setNextToken()
console.log(textTokenizer.getActiveToken())
Output:
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'is' }
import { Tokenizer, WordAndDot } from '@coder-seb/tokenizer'
const textGrammar = new WordAndDot()
const textTokenizer = new Tokenizer(textGrammar, 'This is a string.')
console.log(textTokenizer.getActiveToken())
textTokenizer.setNextToken(3)
console.log(textTokenizer.getActiveToken())
Output:
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'This' }
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'string' }
Get previous token (sequence [<])
import { Tokenizer, WordAndDot } from '@coder-seb/tokenizer'
const textGrammar = new WordAndDot()
const textTokenizer = new Tokenizer(textGrammar, 'This is a string.')
textTokenizer.setNextToken(2)
console.log(textTokenizer.getActiveToken())
textTokenizer.setPreviousToken()
console.log(textTokenizer.getActiveToken())
Output:
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'a' }
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'is' }
Test cases
PASS test/Tokenizer.test.js
Tokenizer tests
Text grammar
√ TC1 input 'a' sequence [] is of token type WORD and value is 'a' (1 ms)
√ TC2 input 'a aa' sequence [>] is of token type WORD and value is 'aa'
√ TC3 input 'a.b' sequence [>] is of token type DOT and value is '.'
√ TC4 input 'a.b' sequence [>>] is of token type WORD and value is 'b'
√ TC5 input 'aa. b' sequence [>>] is of token type WORD and value is 'b'
√ TC6 input 'a .b' sequence [>><] is of token type DOT and value is '.'
√ TC7 input '' sequence [] is of token type END and value is 'END' (1 ms)
√ TC8 input ' ' sequence [] is of token type END and value is 'END'
√ TC9 input 'a' sequence [>] is of token type END and value is 'END' (1 ms)
√ TC10 input 'a' sequence [<] throws IndexException (13 ms)
√ TC11 input '!' sequence [] throws InvalidTokenException
Arithmetic grammar
√ TC12 input '3' sequence [] is of token type NUMBER and value is '3'
√ TC13 input '3.14' sequence [] is of token type NUMBER and value is '3.14'
√ TC14 input '3 + 54 * 4' sequence [>>>] is of token type MUL and value is '*' (1 ms)
√ TC15 input '3+5 # 4' sequence [>>>] throws InvalidTokenException
√ TC16 input '3.0+54.1 + 4.2' sequence [><>>>] is of token type ADD and value is '+'
√ TC17 input '-' sequence [] is of token type SUB and value is '-'
√ TC18 input '/' sequence [] is of token type DIV and value is '/'
√ TC19 input ')(' sequence [>] is of token type OPENING and value is '('
√ TC20 input '3(5-2)' sequence [>>>>>] is of token type CLOSING and value is ')'
√ TC21 input '3 = 4 - 1' sequence [>] is of token type EQUAL and value is '='
Maximal munch grammar
√ TC22 input '3.14' sequence [] is of token type FLOAT and value is '3.14' (1 ms)
√ TC23 input '6 3.14' sequence [] is of token type INTEGER and value is '6'
Exclamation grammar
√ TC24 input ' ! ' sequence [] is of token type EXCLAMATION and value is '!'
Additional coverage tests
√ TC25 input 'Hello World.' after sequence [>>] getTokens() returns array:
[
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'Hello' },
{ Token: 'WORD', Regex: /^[\w|åäöÅÄÖ]+/, Value: 'World' },
{ Token: 'DOT', Regex: /^\./, Value: '.' },
{ Token: 'END', Regex: 'END', Value: 'END' }
]
√ TC26 input 'Hello World.' after sequence [>>] getTokenLength() returns 4 including the END token.
√ TC27 input 'Hello World.' after sequence [>>] hasNextToken() returns
true as next token should be an END token.
√ TC28 input 'Hello World.' after sequence [>>>] hasNextToken() returns
false.
Test Suites: 1 passed, 1 total
Tests: 28 passed, 28 total
Snapshots: 0 total
Time: 0.749 s, estimated 1 s
Ran all test suites.