data-analysis
v0.0.5
Published
Clay data science tooling
Downloads
8
Maintainers
Readme
Clay Data Analysis
Installation
- git clone
- nvm install v8
- npm install
- Authenticate to Google's Cloud API from an associated Google Cloud Platform Project and download the keyfile.json.
- Set the environment variable
GOOGLE_APPLICATION_CREDENTIALS=[PATH]
, replacing[PATH]
with the location of the keyfile.json file you downloaded in the previous step. - Enable both the BigQuery API and the Google Natural Language API within your created project.
Setup & Integration
In your app.js, instantiate Clay Data Science by passing in the parent directory where your tasks (data science features) will live:
dataAnalysis.config({
projectDir: path.resolve('./parent-directory')
});
To leverage save
and publish
hooks, ensure that Clay Data Science is also passed in as an Amphora Plugin during Amphora instantation:
return amphora(
plugins: [dataAnalysis]
})
The parent directory should include a subdirectory called tasks
, with each task including a [handler
], a [transform
], and a [data schema
]. The directory structure should look like this:
- parent-directory
- tasks
- feature
- handler.js
- schema.yml
- transform.js
Data Schema
Coming soon!
Transform
Coming soon!
Handler
Coming soon!
CLI
Clay Data Science also contains a handy CLI for importing legacy data to BigQuery via Elasticsearch. To get started, just set an ELASTICSEARCH_HOST
environment variable.
Commands
npm lint
- runs eslint./bin/cli.js
--help
nlp
NLP
Parses Elasticsearch data based on a specified NLP feature and stores the parsed data into a BigQuery dataset/table.
./bin/cli.js nlp --service elasticsearch --from published-articles.general --to clay_sites.content_classification --field content --query /path/to/query.json --schema /path/to/schema.yml --feature classifyContent
--service, -s <service>
: The data source--feature, -fe <feature>
: An NLP feature, e.g. classifyContent--to, -t <index>.<type>
: Configuration for pulling data from Elasticsearch--from, -fr <dataset>.<table>
: The BigQuery dataset and table to insert data into--field -f <field>
: The data to analyze, based on property/field name--query -q <query>
: The file path to a query to POST to Elasticsearch--schema -sc <schema>
: The file path to a yml schema to pass to BigQuery BigQuery Schemas
Coming Soon
- Tests
- More NLP features!
- More thorough documentation on schemas within tasks