kgx

v0.2.0

Published

3 years ago

Helpful tools for (RDF/Linked Data) Knowledge Graph Exchange and Exploration

Downloads

0High
0Medium
0Low

sandhawke

rdf linked data knowledge graph dataweb semantic web

kgx - knowledge graph toolkit

status: pretty dynamic, still changing the API when I feel like it

Motivation

Sometimes I work with RDF data. I couldn't find any tools that did all the things I wanted, or generally behaved in a way I found comfortable. So I built this as a place to put the things I kept needing.

Biggest things are probably:

Synchronous. Yes, async is great, esp with async/await, but you're going to have some of the data in memory, and when you do, things are simpler. Let's build the API around that, and then have a module for synchronizing that in-memory data with the remote data. (Maybe with async iterators now one could make asych stuff look as good. I might try that some day.)
Converting to/from JavaScript types. I don't always want to work with graphs, and especially with NamedNodes, Literals, etc, so the API tends to convert freely between "native" representation and RDF representations.
Organize the API around quadstores, aka graphstores, aka databases, aka datasets, aka knowledge bases. We call it a "kb" in the code. You make a kb, you add stuff to it, you look at what's in it, you change stuff, you delete stuff, you mirror it to a server somewhere, etc. (It's a "kb" not a "kg" because it can contain many distinct knowledge graphs and their metadata.)
We use trig/sparql-like strings in the API. Most API calls are not performance sensitive, and using a nice RDF syntax is much easier than putting together some complex JavaScript expression.

Example

From example tbl.js

const kgx = require('kgx')
const kb = kgx.memKB()

async function main () {
  const tbl = kb.named('https://www.w3.org/People/Berners-Lee/card#1')
  await kb.fetch(tbl)
  console.log('Got %d triples', [...kb].length)
  // => Got 87 triples                   

  // maybe: for (const q of kb) console.log(kb.quadAsNQ(q))

  for (const {title, name} of
       kb.query('?tbl foaf:title ?title; foaf:name ?name',
                { bind: {'?tbl': tbl} })) {
    console.log(title,name)
    // => Sir Timothy Berners-Lee
  }
}

API documentation

Only parts of the API are currently documented, sorry.

See API Documentation

Thoughts / Plans / Notes

From here on out is just a place where I write down idea, maybe when I've built something at a higher level, and am thinking about whether I can make a general version to go in kgx.

kgx-server sources...

runs web server to show those sources

kgx-view sources...

runs private kgx server and opn the result
or loads it into current instance if there is one? at std port??

kgx-from-{csv|nt|turtle|jsonld|}

web centric, not just parsers
include ldfetch, all-your-base, headless-chrome-crawler, metascraper
include progress and error reporting
include some of the HTML stuff, maybe
so, ever load of a URL results in at least a Fetch (which might be failed)
FETCHID :fetched NG gets put into DG
NG :origin <https://google.com:5151>
NG :source <https://google.com:5151/foo/bar/baz>

kgx-to-{...}

as currently in quadsite
- shape, format, dateformat, linkformat

library:

new kgx.KB()
kb.tablify(shape) returns a kgx.Table() .rows, .headings
kb.filter(f) -> read-only kb
kb.load(src), kb.addSource(src), kb.loader.addSource(src) USE CRAWLER

kgx-crawler

separate process from kgx library, kgx server
- maybe just reads/writes to local fs
- https://www.digitalocean.com/community/tutorials/how-to-install-and-secure-redis-on-debian-9
- https://www.npmjs.com/package/headless-chrome-crawler
- https://news.ycombinator.com/item?id=16437082
- https://github.com/brendonboshell/supercrawler + puppeteer
- https://www.browserless.io/ --- remote puppeteer

Following / Crawling (planned)

kb.crawl({owlImports: true, predicates: true, classes: true})
kb.crawl(['some url', 'some other url'])

How is provenance recorded?

only fetch triples, and graph name is source
only fetch triples, and graph name is linked to source
okay to fetch quads, but .isolate them, then link to source

kb.isolate() returns modified kb (or modifies in place? Or just operates on quadlist?) where any NamedNode graph names have been replaced by new BlankNodes, and the default graph is place into a named graph, whose label (another new BlankNode) is returned. isolate() allows multiple datasets to co-exist in one dataset without interacting until/unless we query across graphs.

Linking to source is done like:

_:gr332 { <a> <b> _:gr332_1 }
_:gr332_1 { <a> <b> <d> }
:fetch332
    providedDefaultGraph _:gr332;
    providedGraph _:gr332, _:gr332_1;
    completed $time;
    date $time;
    lastModified $time;
    fromURL $url
    .

$url { <a> <b> _:gr332_1 }
_:gr332_1 { <a> <b> <d> }
# optional:
:fetch332
    providedDefaultGraph _:gr332;
    providedGraph _:gr332, _:gr332_1;
    completed $time;
    date $time;
    lastModified $time;
    fromURL $url
    .

Maybe it's an option:

defaultGraphName: 'default' | 'source' | 'blank' |

But 'blank' (with isolate()) is the only one that can't get out of control, so it seems like best practice. But it also seems kinda complicated.

It means we kinda want:

kb.crawler.sources = [ { url, lastStarted, lastEnded, **defaultGraphNode** } ]

so you can find the defaultGraphNode. Right? you could also find it via querying. Bascially, Crawler maintains a KB where it owns the default graph, keeping it full of metadata about fetches, and all the named graphs are what they are.

On-Demand (Lazy) Data (planned)

kb.provide(pattern, providerFunction)

Add the pair to the set of active providers. The providers are used whenever looking in the kb for data. Can be used to implement overlayKb (unionKB? mergeKB?), and various otherwise-expensive tricks.

Unclear if we want:

provideBindings (Solutions), nice if the pattern has lots of constants in it and maybe some joins. Basically backward chaining. Could make answering some kinds of queries super efficient; you never actually need to turn things into quads.
provideQuads, simpler in the simple cases, especially like looking for all quads.
provideTriples, even simpler, and lets the system offer provenance pointing to this provisionFunction.

Maybe that's settled by an options parameter.

Rules (planned)

Example File ruleset1.js

const ruleset = [
  {
    if: '?person foaf:firstName ?first; foaf:lastName ?last',
    do: v => { v.name = v.first + ' ' + v.last },
    then: '?person foaf:name ?name'
  }
]

ruleset.name = 'Name Vocabulary Conversion'
ruleset.strategy = 'Forward'

module.exports = ruleset

Use like:

kb.addRules(require('ruleset1'))

Variations:

if/then, all variables/bnodes match, pure datalog
if/then with fresh blank nodes in the then-clause, such as due to a [ ] or ( ) construct; this is now Horn logic (which is Turing complete). Not obvious how to implement backward-chaining with this without FOL-style "Terms". Maybe we make arrays (lists) native, and use them?
iff/then, implies the same rule with clauses swapped
if/do, just executable, forward only
if/do/then, a way to execute builtins to define vars for then (but best to make them side-effect free and using no data except what's in the the argument, which is why they are in a separate module in the example)

Provenance

Provenance chain can use graph label, at least when only triples are concluded. What happens when you want a rule about provenance, though? Terms/tuples seem much better for this than quads. Like, instead of graph literals, just use lists of triples, where triples are spo lists. But those are harder to search when we don't know the provenance.

Does .isolate, and the .provide and Crawler stuff help with this? Do it just like that. Always output isolated stuff, and link it with the provenance.

(If someone equates the graph labels, we'll get lost, though.)

_:gr007 { <a> <b> _:gr007_1 }
_:gr007_1 { <a> <b> <d> }
:fetch007
    providedDefaultGraph _:gr007;
    providedGraph _:gr007, _:gr007_1;
    completed $time;
    date $time;
    lastModified $time;
    fromRule $ruleID       # this is the only different part
    .

We're going to need views to be near-JS-level performance much of the time, via provide, I think. kb.provide inputs JS objects, kb.view outputs them, and we need to make sure there usually a combinatoric explosion of joins in the middle.

owl:InverseFunctionalProperty

As a special case, this reasoning can be done like:

const kb2 = kgx.owlifp.rewrite(kb1, prefns)

It's equivalent to running the IFP rule and the equate rules, but

doesn't use the rules engine or anything sophisticated
doesn't chain; so it's only really appropriate for Datatype Properties, where chaining isn't needed
picks one of the values and discards the rest (you can give the preferred namespace to keep)

Issue: should we use keys instead of IFP? I don't understand the DL-Safe issue in https://www.w3.org/TR/owl2-syntax/#Keys

See:

https://www.w3.org/TR/owl2-syntax/#Inverse-Functional_Object_Properties
https://www.w3.org/TR/owl2-mapping-to-rdf/

This is used for implemented movable schemas.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme