yakka
v0.0.0
Published
Refinery is a python library and platform for building data pipelines that clean datasets and train ML models with human supervision and feedback.
Downloads
3
Readme
Refinery
Refinery is a python library and platform for building data pipelines that clean datasets and train ML models with human supervision and feedback.
It automatically provisions all required infrastructure and guarantees a least-privilege and privacy compliant data architecture.
Features
- Train transformation functions (using AI) that are supervised by humans and continually improved with feedback and corrections.
- Orchestrate transformation with dependency graphs (DAGs)
- Compute data sets when new data arrives or when its dependencies change
- Re-compute data sets when a transformation function is changed or improves from learning
- Auto-provision all required cloud infrastructure
- Auto-configured to be compliant with privacy regulations such as HIPAA and GDPR
- Least-privilege IAM policies with auto-generated reports for regulators
Example
🔧 Note: Refinery is in active development. Not all features are implemented. Check back to see the following example grow.
Below is the most simple Refinery application: a Bucket with a Function that writes to it.
Your application's infrastructure is declared in code. The Refinery compiler analyzes it to auto-provision cloud resources (in this case AWS S3 Bucket and Lambda Function) with least privilege IAM Policy inference.
from refinery import Bucket, function
videos = Bucket("videos")
@function()
async def upload_video():
await videos.put("key", "value")
@asset()
async def transcribed_videos():
...
Research
Inspired by (and integrating with):
- [ ] https://dagster.io/
- [ ] https://www.llamaindex.ai/
- [ ] https://unstructured.io/
- [ ] https://docs.modular.com/mojo/roadmap.html
Naming Options
- Smelt is available on Pip
- Refinery is not available on NPM or Pip
- I maybe have access to alchemy on NPM but it's taken on PIP