@rdfc/dumps-to-feed-processor-ts

v1.0.7

Published

a month ago

Does change detection on entities in consecutive data dumps and transforms these into an ActivityStreams change feed with the entities in a named graph

Downloads

0High
0Medium
0Low

dumps-to-feed-processor-ts

A dumps to feed processor for the RDF Connect framework. It can be run as part of a pipeline using the js-runner, or as a standalone CLI tool.

This processor is used to convert a dump of RDF data to a feed of RDF data. As input, it takes a dump of RDF data and a SHACL shape that describes the members of the feed. It will perform the member extraction algorithm using CBD and the SHACL shape to extract the members from the dump. The extracted members are then compared to the members of the previous version of the dump to determine which members are new, updated, or deleted. To compare the members, the processor first normalizes the members using the RDF Dataset Canonicalization (RDFC-1.0) algorithm, and then hashes the normalized members using the MD5 algorithm. For a new member, the processor will add the member to the feed as an as:Create activity. For an updated member, the processor will add the member to the feed as an as:Update activity. A deleted member will be added to the feed as an as:Delete activity. The ActivityStreams 2.0 ontology (https://www.w3.org/ns/activitystreams#) is used to describe the activities in the feed.

Under the hood, a file-based LevelDB database is used to store the members of the previous version of the dump. This database is used to compare the members of the new dump with the members of the previous dump.

How to run

Clone, install and build

git clone [email protected]:rdf-connect/dumps-to-feed-processor-ts.git
cd dumps-to-feed-feed-processor-ts
npm install
npm run build

Install from npm

npm install @rdfc/dumps-to-feed-processor-ts

Run the CLI version

node . sweden https://admin.dataportal.se/all.rdf https://semiceu.github.io/LDES-DCAT-AP-feeds/shape.ttl\#ActivityShape -o feed.ttl

Run the example pipeline

An example pipeline configuration is provided in the example folder. You can run it with the following command:

npx js-runner example/pipeline.ttl

Configuration

The processor can be configured using the following parameters:

writer: A writer to write the output feed to.
feedname: The name of the feed. Used internally to store the previous version of the feed such that you can use the processor for multiple feeds.
flush: Whether to flush the previous version of the feed. If set to true, the processor will start with an empty feed and add all members from the dump as as:Create activities.
dump: A filename, URL, or serialized quads containing the dump of RDF data.
dumpContentType: The content type of the dump. Use 'identifier' in case of filename or url to be dereferenced.
focusNodesStrategy: extract, sparql, or iris. Use extract in case of automatic extraction (we will use a SPARQL query to find and extract all nodes of one of the DCAT-AP Feeds standalone entity types), sparql in case of a provided SPARQL query, 'iris' in case of comma separated IRIs (NamedNode values)
nodeShapeIri: The IRI of the SHACL shape that describes the members of the feed.
nodeShape: The serialized SHACL shape in text/turtle format that describes the members of the feed. Optional.
focusNodes: Comma separated list of IRIs of the NamedNodes as subjects that should be extracted, or a SPARQL query resolving into a list of entities to be used as focus nodes. Exact value depends on value of focusNodesStrategy. Optional.
dbDir: The directory where the leveldb will be stored. Default is "./"

The SHACL definition of the processor can be found in processor.ttl.

Example

An example pipeline configuration is provided in the example folder: example/pipeline.ttl.

A full example of the processor in action for the Swedish DCAT-AP dump can be found here. This pipeline also contains the other processors set up to provide the dumps-to-feed-processor with the necessary data, and the processors to then write and publish the feed as a Linked Data Event Stream.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme