datahub-client
v0.5.8
Published
APIs for interacting with DataHub
Downloads
57
Readme
Node client and utilities for interacting with https://DataHub.io and handling Data Packages.
Introduction
The DataHub platform stores a lot of different datasets - which are packages of useful data alongside with the description (here is the dataset specification). The data, stored on the DataHub, has a nice structure, views and a description that help people to get insights.
You can also store and share your own datasets on the DataHub
As a programmer, you may want to automate the process of getting or storing the data. Also you may want to integrate your project and the DataHub.
The datahub-client
library is designed for this. Let's explore it together.
Important notes:
- You need to use Node version > 7.6
- When you see the await keyword you should wrap this peace of the code in the async function.
Install
npm install datahub-client --save
const datahub = require('datahub-client')
Quick overview
With datahub-client
you can do things like:
- login to DataHub
- authenticate with the jwt token
- push a dataset:
- get the data from the DataHub:
- init a new dataset
- verify that existing dataset is correct
- transform tabular files info to different formats
Let's explore datahub-client
features more deeply below.
login and authenticate
Documentation is not ready at the moment. Information will be here after refactoring Login module.
Datahub class
Datahub class contains push()
and pushFlow()
methods, that is used to upload a dataset to the DataHub.io
push a dataset (dataset is an instance of a Dataset class: https://github.com/datahq/data.js#datasets):
const {DataHub} = require('datahub-client')
const {Dataset} = require('data.js')
/* secure jwt token and userId is taken from the Login&Auth module */
const datahubConfigs = {
apiUrl: 'http://api.datahub.io/',
token: 'jwt token',
debug: false,
ownerid: 'userId',
}
const datahub = new DataHub(datahubConfigs)
const pushOptions = {findability: 'unlisted'}
const res = await datahub.push(dataset, pushOptions)
console.log(res)
Possible push options:
- findability: one of 'unlisted', 'published', 'private'
- sheets: used to define excel sheets to push. Could be the sheet number, name or array of numbers, names
- schedule: 'every X[m|h|d|w]' (min, hours, days, weeks)
This is an example of correct datahub.push()
response:
{ dataset_id: 'username/finance-vix',
errors: [],
flow_id: 'username/finance-vix/40',
success: true }
If you get any errors - change the debug option: datahubConfigs = {debug: true, ...}
, to see the detailed log.
**pushFlow()
is an experimental method, its documentation is not ready yet.
Get data using the datahub-client
const datahub = require('datahub-client');
const {Dataset} = require('data.js');
const dataset = await Dataset.load(datasetUrl);
const resources = await datahub.get(dataset);
Dataset.load()
takes a path to the data package and returns a dataset object: https://github.com/datahq/data.js#datasets
datahub-client.get()
method accept the dataset object and returns an array with resources from it.
Each resource in the resources is the special File object from the data.js
lib: https://github.com/datahq/data.js#files
info
Info module contains two methods:
infoPackage(dpObj)
Shows the meta information about the dataset. @param {data.js/Datapackage
object} @return: {string}async infoResource(resource)
Shows the information about one particular resource @param {data.js/File
object} - only tabular file objects are supported @return: {string} - ascii table
const data = require('data.js');
const datahub = require('datahub-client');
let dataset = await data.Dataset.load('http://github.com/datasets/finance-vix')
console.log(
datahub.info.infoPackage(dataset),
await datahub.info.infoResource(dataset.resources[0])
)
init
Init module is used to interactively create a new datapackage or update an existing one.
init.init()
scan files/directories in the current directory- Asks user interactively about adding found files to the datapackage
- Generates/extends a
datapackage.json
file, ask user if it is correct - save
datapackage.json
on the disk.
Example: save this code into init.js
const datahub = require('datahub-client');
datahub.init.init()
Run the snippet in the terminal:
node init.js
This process initializes a new datapackage.json file.
Once there is a datapackage.json file, you can still run `data init` to
update/extend it.
Press ^C at any time to quit.
? Enter Data Package name - Some-Package-Name
? Enter Data Package title - Some-Package-Title
? Do you want to scan following directory ".idea" - y/n? n
? Do you want to scan following directory "basic-csv" - y/n? y
? Do you want to add following file as a resource "comma.csv" - y/n? y
comma.csv is just added to resources
? Going to write to /home/user/data/datapackage.json:
{
"name": "some-name",
"title": "some-title",
"resources": [
{
"path": "basic-csv/comma.csv",
<<<<<<<<< cut >>>>>>>>
Is that OK - y/n? y
datapackage.json file is saved in /home/user/data/datapackage.json
validate
This module contains Validator
class, which checks:
- the datapackage data is valid against the descriptor schema
- the descriptor itself is correct.
Using:
const datahub = require('datahub-client')
const validator = new datahub.Validator({identifier: path_to_descriptor})
validator.validate().then(console.log)
If the datapackage is valid - the validator will return True Otherwise it will return an object with the information:
- a TableSchemaError exception
- a list of errors that was found
- help info to find where the error is
{ TableSchemaError: There are 1 type and format mismatch errors (see error.errors') ...
_errors:
[ { TableSchemaError: The value "true" in column "boolean" is not type "date" and format "default" ... } ],
rowNumber: 2,
resource: 'comma',
path: '/home/user/work/basic-csv/comma.csv' }
cat
This module allows you to read the tabular data from inside the data.js/File
and to transform this data into a different formats.
Cat module has several writer functions, for different formats:
- ascii, csv, md, xlsx, html
Each of the writers function convert the given source file into the stream with appropriate format.
The module exports the 'writers' object, that contains all this functions together:
writers = {
ascii: dumpAscii,
csv: dumpCsv,
md: dumpMarkdown,
xlsx: dumpXlsx,
html: dumpHtml
}
Example of use:
const {writers} = require('datahub-client').cat
const data = require('data.js')
const resource = data.File.load('data.csv')
Promise.resolve().then(async ()=>{
const stream = await writers.ascii(resource)
stream.pipe(process.stdout)
// or you can save the stream into a file:
const writeStream = fs.createWriteStream('filename', {flags : 'w'})
stream.pipe(writeStream)
})
Output for writers.ascii
:
┌────────────────────────────────┬────────────────────────────────┬────────────────────────────────┐
│ number │ string │ boolean │
├────────────────────────────────┼────────────────────────────────┼────────────────────────────────┤
│ 1 │ one │ true │
├────────────────────────────────┼────────────────────────────────┼────────────────────────────────┤
│ 2 │ two │ false │
└────────────────────────────────┴────────────────────────────────┴────────────────────────────────┘
Output for writers.md
:
| number | string | boolean |
| ------ | ------ | ------- |
| 1 | one | true |
| 2 | two | false |
CSV:
number,string,boolean
1,one,true
2,two,false
HTML:
<table class="table table-striped table-bordered">
<thead>
<th>number</th>
<th>string</th>
<th>boolean</th>
</thead>
<tbody>
<tr>
<td>1</td>
<td>one</td>
....................
XLSX: excel file
For developers
You need to have Node.js version >7.6
Install
$ npm install
Running tests
We use Ava for our tests. For running tests use:
$ [sudo] npm test
To run tests in watch mode:
$ [sudo] npm run watch:test
Lint
We use XO for checking our code for JS standard/convention/style:
# When you run tests, it first runs lint:
$ npm test
# To run lint separately:
$ npm run lint # shows errors only
# Fixing erros automatically:
$ xo --fix