collabo-flow
v0.1.0
Published
'collabo-flow' is an environment for coordinated execution of flows (or workflows). It is part of the [CollaboScience initiative]{@link http://www.collaboscience.com/}
Downloads
12
Maintainers
Readme
CollaboFlow
'CollaboFlow' is a node.js environment for coordinated execution of flows (or workflows) that can be in node, python, ... It is part of the CollaboScience initiative.
Introduction
Flow
'Flow' is a structured set of tasks that, taken together, serve a particular purpose. The purpose could be to parse a set of data in a particular way, or to analyse the corpus of an author @see [Bukvik]{@link http://bukvik.litterra.info}, or to crawl Wikipedia and analyse community behavior @see [Democracy-Framework - Wikipedia]{@link http://cha-os.github.io/democracy-framework-wikipedia/}, etc, etc.
Flow can be executed fully or partially. A user can run the entire flow, a particular task, or a particular task with dependencies - the task with all the other tasks that the specified task depends upon. For example, a task calculating some data may depend on the task that loads that data in the first place (its 'dependency').
@see:
- the [Flow implementation]{@link collabo-flow.Flow}
- the [FlowStructure]{@link collabo-flow.Flow.FlowStructure}
Task
'Task' describes a clearly isolated unit of work. Task explains 'what' has to be done and 'how' it has to be done. Each task has datasets that it uses for performing intended work. 'How' tasks are executed is described by referring to a specific 'processor' that will perform the work. 'What' the task works with is described with a set of 'datasets' that should be used for performing the intended work.
@see:
- the [Task implementation]{@link collabo-flow.Task}
- the [TaskStructure]{@link collabo-flow.Task.TaskStructure}
Processor
'Processor' is the 'code-implementation' of a particular clearly defined type of work, called up by a task. A processor usually takes the form of an internal or external module in the system. It can be encoded in different programming languages (JavaScript (nodejs), Python, ...). Each task relates 1-1 with the processor that is performing it.
Users can
- use preexisting processors from some free/commercial bundles (packages)
- develop new processors themselves, hire someone to develop them
- 'wrap' existing complex tools by creating light wrapper-processors that will make those tools understandable for CollaboFlow.
@see:
- the [Processor implementation]{@link collabo-flow.Processor}
Dataset
A 'dataset' is a reference to any type (and cardinality) of data organized under a single logical/conceptual unit and identified with a unique namespace(:name). The dataset has different metadata associated with it, it can be versioned etc, but overall it is a set of data the task can uniquely access, load, update or store.
@see:
- the [Dataset implementation]{@link collabo-flow.Dataset}
- the [DatasetStructure]{@link collabo-flow.Dataset.DatasetStructure}
Port
A 'port' is a structural unit that connects tasks and datasets. Every task can have multiple named ports for accessing datasets. The two most common ports are 'in' and 'out'. Those two ports contains input datasets that task uses to retrieve all necessary data for processing, and output datasets that task uses for storing working results, retrospectively. Usually each port contains just one dataset, but in the case a task processed an array of datasets, the port can contain any provisionally large array of datasets.
Each port has its semantically described direction. It can be one of the following:
- in: datasets will be used just for reading from
- out: datasets will be used just for writing to
- inout: datasets will be used both for reading from and writing to
@see:
- the [Port implementation]{@link collabo-flow.Port}
- the [PortStructure]{@link collabo-flow.Port.PortStructure}
Namespace and name
'Namespace' is a notation used for organizing both tasks and datasets. It consists of a set of words separated with periods, for example:
df-wiki.get-pages-revisions.by-pages-list
or<NAMESPACE_DATA>.wiki-user.registered.contribution
, or<NAMESPACE_DATA>.wiki-user.registered.talk-pages.snapshots
.
For the last two namespaces, we can see that they share the same beginning ``<NAMESPACE_DATA>.wiki-user.registered`.
To fully specify (address) one entity, either a dataset or a task, we use both namespace and the name of the entity in the form 'namespace:name'.
'Dependency' between tasks is mainly structured through the datasets they use. For example, if taskB consumes dataset1 (through its input port), and taskA produces dataset1 (through its output port), then we say that 'taskB depends on taskA'.
Examples
Task
{
"type": "task",
"namespace": "<NAMESPACE_TASK>.execute",
"name": "titles-to-pageids",
"comment": "Converts list of titles (+adding title prefix) into list with titles and pageids",
"processorNsN": "df-wiki.get-pages.titles-to-pageids",
"execution": 0,
"forceExecute": true,
"isPower": true,
"params": {
"outputFileFormats": ["json", "csv"],
"processorSubType": "titles-to-pageids",
"pagePrefix": "",
"wereSpecialPages": false,
"retrievingDepth": 0,
"wpNamespacesToRetrieve": [0],
"everyWhichToRetrieve": 1,
"csvSeparator": ", ",
"storeInDb": false,
"transferExtraFields": ["popularity", "accuracy"]
},
"ports": {
"in": {
"direction": "in",
"datasets": [
{
"namespace": "<NAMESPACE_DATA>",
"name": "zeitgeist_2008-2013"
}
]
},
"listOut": {
"direction": "out",
"datasets": [
{
"namespace": "<NAMESPACE_DATA>",
"name": "zeitgeist_2008-2013-with-ids"
}
]
}
}
}
Notably, there are a few important aspects:
- type: it tells that the entity in the flow is a task
- namespace: the namespace that the task belongs to
- name: the name of the task (inside of the namespace)
- comment: a comment about the task (what it does, etc), the phrasing is free
- processorNsN: a full namespace:name path to the processor that will execute the task
- params: set of properties we are passing to the processor to tweak the task's execution
- ports: set of ports to the task, where each port has an array of datasets that task can access In this case we have two ports, each with one dataset associated with it.
For more details please @see the [TaskStructure]{@link collabo-flow.Task.TaskStructure}
Ports and dataset
Creating a flow
In order to create a new flow, the user needs to create a new json file with the flow's description in it @see [FlowStructure]{@link collabo-flow.Flow.FlowStructure}.
Example of a blanc flow:
{
"name": "Micro-analysis",
"authors": "Sinisha Rudan, Sasha Mile Rudan, Lazar Kovacevic",
"version": "0.1",
"description": "Micro-analyses of Wikipedia behavior",
"environment": {
"variables": {
"NAMESPACE_DATA": "democracy-framework.wikipedia"
}
},
"data": [
]
}
Note: Our sub-project [Bukvik]{@link http://bukvik.litterra.info} does support visualization and editing of the flow, but we are still working on migrating it to a more general context necessary to visualize, edit, and run collaboflows.
node df-wiki/exec.js micro-analysis.json
After setting up all the metadata in the newly created flow, the user can add the first task @see the [TaskStructure]{@link collabo-flow.Task.TaskStructure}. For each task, the user needs to find the appropriate processor (@see [Processor implementation]{@link collabo-flow.Processor}), or create a new one. Some of easily understandable processors can be found among [core Collabo-Flow processors]{@link collabo-flow.processors}.
We need to set up the root of the namespace, at the moment we can do it through 'collabo-flow.execute-flow.js' setting FFS constructor
Creating a new processor
Create a new nodejs file: analyse-wp-polices.js
and store it with other processors: df-wiki/processors/analyse-wp-polices.js
.
Write stub content for the new processor
Creating a new sub-processor
Register it in 'collabo-flow.execute-flow:loadProcessors' :
// analyze-wp-polices
this.processors['df-wiki.processors.analyze-wp-polices.analyze-user-talk-pages'] = "../df-wiki/processors/analyze-wp-polices";
Possible Scenarios
Democracy Framework
Bukvik
multiple environments
- python,
- nodejs, and
- potentially more (R, java, ...)
communicating through messagging
communicating through streams
communicating through caches
communicating through filesystem storage
communicating through databases
communicating through services
need to control tasks'/services' execution
need to pipeline
need to verify which parts of dataflow have to be reprocessed/updated
LitTerra
ToDo
CollaboFlow
Release History
- 0.5.2 First separate package
- previous: existing as a part of: democracy-framework-wikipedia