kgx
v0.2.0
Published
Helpful tools for (RDF/Linked Data) Knowledge Graph Exchange and Exploration
Downloads
2
Maintainers
Readme
kgx - knowledge graph toolkit
status: pretty dynamic, still changing the API when I feel like it
Motivation
Sometimes I work with RDF data. I couldn't find any tools that did all the things I wanted, or generally behaved in a way I found comfortable. So I built this as a place to put the things I kept needing.
Biggest things are probably:
Synchronous. Yes, async is great, esp with async/await, but you're going to have some of the data in memory, and when you do, things are simpler. Let's build the API around that, and then have a module for synchronizing that in-memory data with the remote data. (Maybe with async iterators now one could make asych stuff look as good. I might try that some day.)
Converting to/from JavaScript types. I don't always want to work with graphs, and especially with NamedNodes, Literals, etc, so the API tends to convert freely between "native" representation and RDF representations.
Organize the API around quadstores, aka graphstores, aka databases, aka datasets, aka knowledge bases. We call it a "kb" in the code. You make a kb, you add stuff to it, you look at what's in it, you change stuff, you delete stuff, you mirror it to a server somewhere, etc. (It's a "kb" not a "kg" because it can contain many distinct knowledge graphs and their metadata.)
We use trig/sparql-like strings in the API. Most API calls are not performance sensitive, and using a nice RDF syntax is much easier than putting together some complex JavaScript expression.
Example
From example tbl.js
const kgx = require('kgx')
const kb = kgx.memKB()
async function main () {
const tbl = kb.named('https://www.w3.org/People/Berners-Lee/card#1')
await kb.fetch(tbl)
console.log('Got %d triples', [...kb].length)
// => Got 87 triples
// maybe: for (const q of kb) console.log(kb.quadAsNQ(q))
for (const {title, name} of
kb.query('?tbl foaf:title ?title; foaf:name ?name',
{ bind: {'?tbl': tbl} })) {
console.log(title,name)
// => Sir Timothy Berners-Lee
}
}
API documentation
Only parts of the API are currently documented, sorry.
Thoughts / Plans / Notes
From here on out is just a place where I write down idea, maybe when I've built something at a higher level, and am thinking about whether I can make a general version to go in kgx.
kgx-server sources...
- runs web server to show those sources
kgx-view sources...
- runs private kgx server and opn the result
- or loads it into current instance if there is one? at std port??
kgx-from-{csv|nt|turtle|jsonld|}
- web centric, not just parsers
- include ldfetch, all-your-base, headless-chrome-crawler, metascraper
- include progress and error reporting
- include some of the HTML stuff, maybe
- so, ever load of a URL results in at least a Fetch (which might be failed)
FETCHID :fetched NG
gets put into DGNG :origin <https://google.com:5151>
NG :source <https://google.com:5151/foo/bar/baz>
kgx-to-{...}
- as currently in quadsite
- shape, format, dateformat, linkformat
library:
- new kgx.KB()
- kb.tablify(shape) returns a kgx.Table() .rows, .headings
- kb.filter(f) -> read-only kb
- kb.load(src), kb.addSource(src), kb.loader.addSource(src) USE CRAWLER
kgx-crawler
- separate process from kgx library, kgx server
- maybe just reads/writes to local fs
- https://www.digitalocean.com/community/tutorials/how-to-install-and-secure-redis-on-debian-9
- https://www.npmjs.com/package/headless-chrome-crawler
- https://news.ycombinator.com/item?id=16437082
- https://github.com/brendonboshell/supercrawler + puppeteer
- https://www.browserless.io/ --- remote puppeteer
Following / Crawling (planned)
kb.crawl({owlImports: true, predicates: true, classes: true})
kb.crawl(['some url', 'some other url'])
How is provenance recorded?
- only fetch triples, and graph name is source
- only fetch triples, and graph name is linked to source
- okay to fetch quads, but .isolate them, then link to source
kb.isolate() returns modified kb (or modifies in place? Or just operates on quadlist?) where any NamedNode graph names have been replaced by new BlankNodes, and the default graph is place into a named graph, whose label (another new BlankNode) is returned. isolate() allows multiple datasets to co-exist in one dataset without interacting until/unless we query across graphs.
Linking to source is done like:
_:gr332 { <a> <b> _:gr332_1 }
_:gr332_1 { <a> <b> <d> }
:fetch332
providedDefaultGraph _:gr332;
providedGraph _:gr332, _:gr332_1;
completed $time;
date $time;
lastModified $time;
fromURL $url
.
vs
$url { <a> <b> _:gr332_1 }
_:gr332_1 { <a> <b> <d> }
# optional:
:fetch332
providedDefaultGraph _:gr332;
providedGraph _:gr332, _:gr332_1;
completed $time;
date $time;
lastModified $time;
fromURL $url
.
Maybe it's an option:
- defaultGraphName: 'default' | 'source' | 'blank' |
But 'blank' (with isolate()) is the only one that can't get out of control, so it seems like best practice. But it also seems kinda complicated.
It means we kinda want:
kb.crawler.sources = [ { url, lastStarted, lastEnded, **defaultGraphNode** } ]
so you can find the defaultGraphNode. Right? you could also find it via querying. Bascially, Crawler maintains a KB where it owns the default graph, keeping it full of metadata about fetches, and all the named graphs are what they are.
On-Demand (Lazy) Data (planned)
kb.provide(pattern, providerFunction)
Add the pair to the set of active providers. The providers are used whenever looking in the kb for data. Can be used to implement overlayKb (unionKB? mergeKB?), and various otherwise-expensive tricks.
Unclear if we want:
- provideBindings (Solutions), nice if the pattern has lots of constants in it and maybe some joins. Basically backward chaining. Could make answering some kinds of queries super efficient; you never actually need to turn things into quads.
- provideQuads, simpler in the simple cases, especially like looking for all quads.
- provideTriples, even simpler, and lets the system offer provenance pointing to this provisionFunction.
Maybe that's settled by an options parameter.
Rules (planned)
Example File ruleset1.js
const ruleset = [
{
if: '?person foaf:firstName ?first; foaf:lastName ?last',
do: v => { v.name = v.first + ' ' + v.last },
then: '?person foaf:name ?name'
}
]
ruleset.name = 'Name Vocabulary Conversion'
ruleset.strategy = 'Forward'
module.exports = ruleset
Use like:
kb.addRules(require('ruleset1'))
Variations:
- if/then, all variables/bnodes match, pure datalog
- if/then with fresh blank nodes in the then-clause, such as due to a [ ] or ( ) construct; this is now Horn logic (which is Turing complete). Not obvious how to implement backward-chaining with this without FOL-style "Terms". Maybe we make arrays (lists) native, and use them?
- iff/then, implies the same rule with clauses swapped
- if/do, just executable, forward only
- if/do/then, a way to execute builtins to define vars for then (but best to make them side-effect free and using no data except what's in the the argument, which is why they are in a separate module in the example)
Provenance
Provenance chain can use graph label, at least when only triples are concluded. What happens when you want a rule about provenance, though? Terms/tuples seem much better for this than quads. Like, instead of graph literals, just use lists of triples, where triples are spo lists. But those are harder to search when we don't know the provenance.
Does .isolate, and the .provide and Crawler stuff help with this? Do it just like that. Always output isolated stuff, and link it with the provenance.
(If someone equates the graph labels, we'll get lost, though.)
_:gr007 { <a> <b> _:gr007_1 }
_:gr007_1 { <a> <b> <d> }
:fetch007
providedDefaultGraph _:gr007;
providedGraph _:gr007, _:gr007_1;
completed $time;
date $time;
lastModified $time;
fromRule $ruleID # this is the only different part
.
We're going to need views to be near-JS-level performance much of the time, via provide, I think. kb.provide inputs JS objects, kb.view outputs them, and we need to make sure there usually a combinatoric explosion of joins in the middle.
owl:InverseFunctionalProperty
As a special case, this reasoning can be done like:
const kb2 = kgx.owlifp.rewrite(kb1, prefns)
It's equivalent to running the IFP rule and the equate rules, but
- doesn't use the rules engine or anything sophisticated
- doesn't chain; so it's only really appropriate for Datatype Properties, where chaining isn't needed
- picks one of the values and discards the rest (you can give the preferred namespace to keep)
Issue: should we use keys instead of IFP? I don't understand the DL-Safe issue in https://www.w3.org/TR/owl2-syntax/#Keys
See:
- https://www.w3.org/TR/owl2-syntax/#Inverse-Functional_Object_Properties
- https://www.w3.org/TR/owl2-mapping-to-rdf/
This is used for implemented movable schemas.