rdf-nx-parser
v1.0.3
Published
Non-validating tokenizer / parser for the RDF N-Triples and N-Quads serializations (or any “N-x”)
Downloads
518
Readme
rdf-nx-parser
A non-validating tokenizer and parser for the RDF N-Triples and N-Quads serializations (or any “N-x”).
Provides parsing of N-Triples and N-Quads from strings, or tokenizing any “N-x” string.
Why?
There are enough parsers already that are faster (see last section), but having a parser for Node.js is useful for building smaller tools.
Usage
npm install --save rdf-nx-parser
The module exports a parser object:
var parser = require('rdf-nx-parser');
Parsing
Use parseTriple()
to parse an N-Triples statement, parseQuads()
for N-Quads. Both return an objects, or null
if the input can't be parsed.
var quad = parser.parseQuad(
'_:foo ' +
'<http://example.com/bar> ' +
'"\\u9B3C\\u8ECA"@jp ' +
'<http://example.com/baz> .'
);
console.log(JSON.stringify(quad, null, 4));
{
"subject": {
"type": "blankNode",
"value": "foo"
},
"predicate": {
"type": "iri",
"value": "http://example.com/bar"
},
"object": {
"type": "literal",
"value": "鬼車",
"language": "jp"
},
"graphLabel": {
"type": "iri",
"value": "http://example.com/baz"
}
}
Literal objects can have an additional language
or datatypeIri
property.
The parser does not verify that the data adheres to the [grammar] 1. It will instead happily parse anything as good as it can:
> parser.parseQuad('<foo> <:///baz> "bar" <$!#]&> .');
{ subject: { type: 'iri', value: 'foo' },
predicate: { type: 'iri', value: ':///baz' },
object: { type: 'literal', value: 'bar' },
graphLabel: { type: 'iri', value: '$!#]&' } }
You can optionally pass an options object to these methods as a second parameter, shown with the defaults here:
parser.parseTriple(input, {
// Set to `true` to get unparsed strings as `value`
//properties
asString: false,
// Include the unparsed token as `valueRaw` property
// when returning objects
includeRaw: false,
// Decode unicode escapes, `\uxxxx` and `Uxxxxxxxx`
// (but not percent encoding or punycode)
unescapeUnicode: true
});
Parsing a whole file of N-Triples / N-Quads lines can easily be done e. g. with Node's readline
module, see the example.
Tokenization
An arbitrary number of “N-x” tokens can be extracted from a string into an array of token objects with the tokenize()
method:
> parser.tokenize(
'<foo> _:bar . "123"^^<http://example.com/int> ' +
'"\u0068\u0065\u006C\u006C\u006F"@en-US . .'
);
[ { type: 'iri', value: 'foo' },
{ type: 'blankNode', value: 'bar' },
{ type: 'endOfStatement', value: '.' },
{ type: 'literal',
value: '123',
datatypeIri: 'http://example.com/int' },
{ type: 'literal',
value: 'hello',
language: 'en-US' },
{ type: 'endOfStatement', value: '.' },
{ type: 'endOfStatement', value: '.' } ]
Each token has at least a type
and a value
property. There are four token types: iri
, literal
, blankNode
and endOfStatement
(can be listed with the getTokenTypes()
method).
Implementation
The implementation is based on regular expressions (to split the input into tokens) – they are pretty fast on V8. This regex-based implementation is faster than a previous simple state machine (that read the input in one scan). Seems like regexes can be compiled more effectively into machine code.
Node.js version support
Works with Node.js 0.10 and higher.
Tests
Run with: npm test
(mocha, Chai, Istanbul)
Similar projects
- Raptor library, C
- nxparser, Java