qb-json-next
v2.1.3
Published
fast (~300 MB/sec) and simple incremental JSON parsing that can start and stop at any byte offset
Downloads
48
Maintainers
Readme
qb-json-next
A very fast, very light, zero-dependency, validating JSON tokenizer (~300 MB/sec running node 6 on 2.2 GHz Intel i7).
qb-json-next provides core parsing and state management for converting raw JSON buffers into tokens in a tiny library.
Check out qb-json-tokenizer for a wrapper over this tokenization that makes it easier to parse arbitrary-size streaming chunks of JSON.
Complies with the 100% test coverage and minimum dependency requirements of qb-standard .
install
npm install qb-json-next
next (ps, opt)
next() is the function returned by require('qb-json-next'). It updates a ps or "parse-state" object. The ps object can be any object with a 'next_src' buffer.
ps - the "parse state" object which is updated with every call to next() (see following section)
opt {
err - a function that handles errors. if not specified, errors will be thrown instead.
}
for example, we can parse and print all tokens in a buffer like this:
const next = require('qb-json-next')
const ps = { next_src: new Buffer( '{ "a": [1,2,3] }' ) }
while ( next(ps) ) {
console.log( next.tokstr(ps) ) // see documentation for tokstr() details
}
output:
> {@0 // start of object at 0
> k3@2:[@7 // key (3 bytes) at 2, start of array at 7
> d1@8 // decimal (1 byte) at 8
> d1@10 // decimal (1 byte) at 10
> d1@12 // decimal (1 byte) at 12
> ]@13 // end array at 13
> }@15 // end object at 15
next() returns the last token parsed, the ps.tok property, which is zero if a complete token was not parsable. ps.tok and other properties are also explained in the documentation below.
next() has other helpful properties such as TOK, POS, ECODE, posname() and tokstr() which are explained in the following section.
The parse-state object (ps)
Each call to next(ps) updates the parse-state object or 'ps'. A parse-state object can be any object you like and can be as simple as:
ps = { next_src: a_source_buffer }
Upon calling next(ps), the parse state will be enriched with a complete collection of parse state information:
{
// user-populated:
next_src // [byt] - users must set this to the src buffer/chunk of JSON to process before calling next().
Calling next() will move this buffer to the 'src' field for processing, at which point user
may set next_src to the next chunk to be processed.
// qb-next managed fields:
src // [byt] - set to the current buffer processed
soff // int - the prior src offset. e.g. ps.soff + ps.vlim = total byte offset from start
vcount // int - value count - number of complete values or key-values read so far
koff // int - key offset
klim // int - key limit
tok // int - token - integer indicating what was parsed or encountered, see chart
voff // int - value offset
vlim // int - value limit
stack // [byt] - ascii open braces/brackets representing array or object containers and depth
pos // int - relative parse position code (before value, after key...)
ecode // int - end-code used to indicate special termination state such as truncated or illegal values
}
Parsing across buffers with whole values
The ps.next_src property is checked before returning from next() whenever ps.src is empty or finished. If set, ps.next_src is moved to ps.src, limits are reset, and parsing continues with the new buffer. This allows parsing across buffers:
const next = require('qb-json-next')
const ps = {}
while (ps.next_src = get_next_buffer_somehow()) {
while (next(ps)) {
console.log(next.tokstr(ps))
}
}
Parsing stopping at any point retains exact information about the stop state, but next() can only continue parsing after a whole value or key-value pair.
(You can use qb-json-align to handle arbitrary buffer splits)
next() cannot continue after part of a key, part of a value or after a key with no subsequent value (array and object start brackets count as a whole value).
src1 src2 split allowed?
'[' '11, 12, 13]' YES
'[11, 12' ', 13]' YES
'[11, 12, 13' ']' YES
'{ ' '}' YES
'{ "a": true ' '}' YES
'{ "a": true' ', "b": false }' YES
'{ "a": true, ' '"b": false }' YES
'{ "a": true, "b": [' '82] }' YES - key "b" is in same buffer as array start
'[ 11, 1' '2, 13 ]' NO - split number 12
'{ "a' '": true, "b": [82] }' NO - split key "a"
'{ "a": true, "b": ' '[82] }' NO - key "b" is in different buffer as array start [
Parsing buffers with any arbitrary split
The module qb-json-align supports continued parsing from any split buffer including truncated keys and values.
It's simple to use - just call align(ps) after setting ps.next_src and before calling next(ps) and split values are handled for you:
while (ps.next_src = get_next_buffer_somehow()) {
align(ps) // add this call after next_src is set to handle any split.
while (next(ps)) {
console.log(next.tokstr(ps))
}
}
ps.vcount
The number of completed values. Array and object start brackets are not counted, while array and object end brackets are counted. Strings are only counted after vlim passes the closing quote and numbers are only counted after a non-number delimiter byte is read. Keys are not counted, only the completed value following the key is counted counting every key-value pair as "1" value.
offsets: ps.koff, ps.klim, ps.voff, and ps.vlim
The offsets describe key and value start and end points. ps.koff and ps.voff are inclusive offsets where the key and value start (starting at zero), while ps.klim and ps.vlim are exclusive limits which are one byte after the last key or value byte.
When ps.koff = ps.klim, then there is no key. Object values always have a key, while array values never do.
When ps.voff = ps.vlim, then there is no value.
Remember that src data is UTF-8 encouded, not UTF-16 encoded like javascript strings so non-asci will not directly translate to strings. qb-utf8-to-str-tiny is a light-weight library (though not fast) for converting UTF-8 to javascript strings. Also note that string offsets include the quotes (ascii 34)
For example:
(ascii as strings)
koff klim voff vlim key value stack
|----------------------------------------- 0 0 0 0
|
| |--------------------------------------- 0 0 1 2 { {
| |
| | |------------------------------- 9 10 10 10 " {
| | |
| | | |----------------------------- 9 12 12 12 "a" {
| | | |
| | | | |--------------------------- 9 12 14 14 "a" {
| | | | |
| | | | | |------------------------- 9 12 15 16 "a" " {
| | | | | |
| | | | | | |---------------------- 9 12 15 19 "a" "hi" {
| | | | | | |
| | | | | | | (key/value flushed out - to callback)
| | | | | | | 12 12 19 19 {
| | | | | | |
| | | | | | | |---------------- 21 24 25 25 "b" {
| | | | | | | |
| | | | | | | | |-------------- 21 24 26 27 "b" [ {[
| | | | | | | | |
| | | | | | | | | |--------- 24 24 32 32 {[
| | | | | | | | | |
| | | | | | | | | | |------- 24 24 32 33 ] {
| | | | | | | | | | |
| | | | | | | | | | | |----- 24 24 35 36 ]
| | | | | | | | | | | |
1 2 3
01234567890123456789012345678901234567
{ "a": "hi", "b": [ 1, 2 ] }
ps.tok
The ps.tok property holds an ascii byte representing the type of value last read, or zero if parsing ended (at src limit or on error). Possible values for tok are defined in the next.TOK object property:
next.TOK
The TOK object maps names to the ps.tok integers returned by next(). For all but string and decimal, the token returned is simply the same as the first byte encountered. 't' for true, '{' for object start, etc. Here they are just as they are defined in the code:
const TOK = {
ARR: 91, // '[' - array start
ARR_END: 93, // ']' - array end
DEC: 100, // 'd' - a decimal value starting with: -, 0, 1, ..., 9
FAL: 102, // 'f' - false
NUL: 110, // 'n' - null
STR: 115, // 's' - a string value starting with "
TRU: 116, // 't' - true
OBJ: 123, // '{' - object start
OBJ_END: 125, // '}' - object end
}
next.tokstr(ps, detail)
Given a parse state, next.tokstr() returns an abbreviated string representation of the type of token, the length of key and value and the byte offset into ps.src.
Setting detail option to true will include the abbreviated position name (see POS), and a string representation of the stack. For example:
const next = require('qb-json-next')
console.log('First object:')
const ps = {next_src: new Buffer('{ "a": [1,2,3] }')}
while (next(ps)) {
console.log(next.tokstr(ps))
}
console.log('Second object with detail = true:')
ps.next_src = new Buffer(', { "b": [4,5,6] }')
while (next(ps)) {
console.log(next.tokstr(ps, true))
}
outputs:
First object:
{@0
k3@2:[@7
d1@8
d1@10
d1@12
]@13
}@15
Second object with detail = true:
{@2:O_BF:{
k3@4:[@9:A_BF:{[
d1@10:A_AV:{[
d1@12:A_AV:{[
d1@14:A_AV:{[
]@15:O_AV:{
}@17:A_AV
ps.stack and ps.pos
ps.stack and ps.pos keep track of the nested context and relative position of parsing.
ps.stack uses an array of integers (object and array open-braces ascii) to track depth and context
91 // ascii for '[' - array start
123 // ascii for '{' - object start
stack: [] | [123] | [123,91] | [123] | []
| | | |
{ name : "Samuel" , tags : [ "Sam", "Sammy" ] } , "another value"
ps.pos describes parser position relative to a JSON key or value. Note that both the start and end brackets of arrays and objects are considered values when describing position.
value value
| key value key value value |
| | | | | value value | |
| | | | | | | | |
{ name : "Samuel", tags : [ "Sam", "Sammy" ] }
There are 2 possible positions before, and after that define parse position relative to a key or value plus a first indicator to indicate if it is the first item in a new context:
before-first-value stack = ''
|
| before-first-key stack = '{'
| |
| | after-key
| | |
| | | before-value
| | | |
| | | |
| | | |
| | | | after-value
| | | | |
| | | | | before-key
| | | | | |
| | | | | |
| | | | | |
| | | | | | after-key
| | | | | | |
| | | | | | | before-value
| | | | | | | |
| | | | | | | | before-first-value stack = '{['
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | after-value
| | | | | | | | | |
| | | | | | | | | |before-value
| | | | | | | | | ||
| | | | | | | | | ||
| | | | | | | | | ||
| | | | | | | | | || after-value
| | | | | | | | | || |
| | | | | | | | | || | after-value stack = '{'
| | | | | | | | | || | | after-value stack = ''
| | | | | | | | | || | | |
| | | | | | | | | || | | | before-value
| | | | | | | | | || | | | |
| | | | | | | | | || | | | |
| | | | | | | | | || | | | |
| | | | | | | | | || | | | | after-value
| | | | | | | | | || | | | | |
{ name : "Samuel" , tags : [ "Sam", "Sammy" ] } , "another value"
Most state management in next() is the matter of a bitwise-or and one or two array lookups per token, which is why next() can validate so quickly.
next.pos positions are defined in the next.POS object:
next.POS
The POS object maps names to the ps.pos values used by next().
ps.pos integers allow super-fast performance, but are not user-friendly, so these POS names
give some level readability (though admittedly still very brief). Here they are as
defined in the code:
// values for ps.pos(ition). LSB (0x7F) are reserved for token ascii value.
const POS = {
A_BF: 0x080, // in array, before first value
A_BV: 0x100, // in array, before value
A_AV: 0x180, // in array, after value
O_BF: 0x200, // in object, before first key
O_BK: 0x280, // in object, before key
O_AK: 0x300, // in object, after key
O_BV: 0x380, // in object, before value
O_AV: 0x400, // in object, after value
}
(Design note - the object/array context is redundant with stack information, but is included in ps.pos to allow one-step lookup and validation which is a factor in making next() validation so fast.)
next.posname (pos)
Given a ps.pos integer as input next.posname() returns a brief string code (A_BF, A_BV, O_BF...)- the same names used for next.POS
ps.ecode
The ps.ecode or "end-code" describes special parsing end states. The four possible end states are defined by the next.ECODE object:
next.ECODE
The ECODE object maps names of special parsing "end" states to integers used by ps.ecode. When one of these special states occurs, ps.tok is set to zero and ps.ecode is set to one of the following:
const ECODE = {
BAD_VALUE: 66, // 'B' encountered invalid byte or series of bytes
TRUNC_DEC: 68, // 'D' end of buffer was value was a decimal ending with a digit (0-9). it is *possibly* unfinished
KEY_NO_VAL: 75, // 'K' object key complete, but value did not start
TRUNCATED: 84, // 'T' key or value was unfinished at end of buffer
UNEXPECTED: 85, // 'U' encountered a recognized token in wrong place/context
}
How it works - Understanding the parse graph (state and stack)
The parse graph is a simple mapping of parse states to other allowed states. For next(), we employ an integer array to do this in the most efficent manner possible. We encode position state (ps.pos) along with ascii values in the same low integer to create the graph.
let state1 = states[state0 + ascii-value]
If the state isn't allowed, then state1 is zero. If allowed, it is the new ps.pos value - the next position.
let state2 = states[state1 + ascii-value]
This simple mechanism works for all state transitions, except when we leave context of an object or array.
When a '}' or ']' is encountered, the new state will have no context set (you can see this for yourself in
the Adding Custom Rules to Parsing section, below).
When closing an object or array, the 'stack' is used to supplement missing context (91 is ascii for array-close):
if (stack.length !== 0) { state1 |= (stack[stack.length - 1] === 91 ? in_arr : in_obj) }
Adding Custom Rules to Parsing
Though qb-json-next uses bit manipulation, I have tried to make the rules as readable as possible so even if you aren't comfortable with bit twiddling, you may understand and modify the parse rules. Can you see how to make parsing tolerant of trailing commas by looking at the states below? (the answer is at the bottom of this section).
First, the setup. We create an integer-to-integer mapping of all the allowed states (see generate-maps.js for this script that generates the ascii state array). The full parse graph is defined with 10 lines of configuration:
// position(s) + token(s) -> new position (state + token -> new state)
map([POS.A_BF, POS.A_BV], 'ntfds', POS.A_AV)
map([POS.A_AV], ',', POS.A_BV)
map([POS.A_BF, POS.A_BV, POS.O_BV], '[', POS.A_BF)
map([POS.A_BF, POS.A_BV, POS.O_BV], '{', POS.O_BF)
map([POS.A_BF, POS.A_AV], ']', POS.A_AV) // use any non-zero value here - stack is used to check new position
map([POS.O_BF, POS.O_AV], '}', POS.A_AV) // use any non-zero value here - stack is used to check new position
map([POS.O_AV], ',', POS.O_BK)
map([POS.O_BF, POS.O_BK], 's', POS.O_AK) // s = string
map([POS.O_AK], ':', POS.O_BV)
map([POS.O_BV], 'ntfds', POS.O_AV)
That's pretty dense, and the codes look cryptic, but it is easy to see the 'big picture' once you understand the abbreviations.
map([POS.A_BF, POS.A_BV], 'ntfds', POS.A_AV)
... maps positions
"in-array, before-first-value" and "in-array, before-value"
for tokens (null, true, false, decimal, string)
to position ("in-array, after-value")
And this statement:
map([POS.O_AV], ',', POS.O_BK)
... maps position
"in-object, after-value"
for token ',' (comma)
to position "in-object, before-key"
If that made sense, I encourage looking at the map code - it is just as understandable as that...
To make the graph tolerate trailing commas in arrays [1,2,3,], add an array-end rule where a value is expected (before-value):
map([POS.A_BV], ']', POS.A_AV )
In fact, if you look above, this looks extremely similar to the existing 'A_BF' + ']' rule that allows arrays to close without any content at all:
map([POS.A_BF, POS.A_AV], ']', POS.A_AV)
... and we could have just added our case to that rule instead, if we liked
map([POS.A_BF, POS.A_AV, POS.A_BV], ']', POS.A_AV)
To make the graph also tolerate trailing commas in an empty array [,], add an array-comma rule where a first value is expected (before-first-value):
map([POS.A_BF], ',', POS.A_BV )