insta-integration
v1.0.10
Published
insta-integration GitHub Action
Downloads
31
Maintainers
Readme
insta-integration - Integration Testing
Automated integration tests for any application/job.
- Spin up any external services
- Generate production-like data
- Run data validations to ensure application/job works as expected
Problems it can help with:
- Unreliable test environments
- Dependencies on other teams
- Simulate complex data flows
Usage
CLI
Install via
npm install -g insta-integration
Create YAML file
insta-integration.yaml
to define your integration testsRun
insta-integration
GitHub Action
Create YAML file
.github/workflows/integration-test.yaml
name: Integration Test on: push: branches: - * jobs: integration-test: name: Integration Test runs-on: ubuntu-latest steps: - name: Run integration tests uses: data-catering/insta-integration@v1
Create YAML file
insta-integration.yaml
to define your integration testsPush your code and the GitHub Action will run
Services
The following services are available to run alongside your application/job.
| Service Type | Service | Supported | | --------------------------- | ------------- | --------- | | Change Data Capture | debezium | ✅ | | Database | cassandra | ✅ | | Database | cockroachdb | ✅ | | Database | elasticsearch | ✅ | | Database | mariadb | ✅ | | Database | mongodb | ✅ | | Database | mssql | ✅ | | Database | mysql | ✅ | | Database | neo4j | ✅ | | Database | postgres | ✅ | | Database | spanner | ✅ | | Database | sqlite | ✅ | | Database | opensearch | ❌ | | Data Catalog | marquez | ✅ | | Data Catalog | unitycatalog | ✅ | | Data Catalog | amundsen | ❌ | | Data Catalog | datahub | ❌ | | Data Catalog | openmetadata | ❌ | | Distributed Coordination | zookeeper | ✅ | | Distributed Data Processing | flink | ✅ | | HTTP | httpbin | ✅ | | Identity Management | keycloak | ✅ | | Job Orchestrator | airflow | ✅ | | Job Orchestrator | dagster | ✅ | | Job Orchestrator | mage-ai | ✅ | | Job Orchestrator | prefect | ✅ | | Messaging | activemq | ✅ | | Messaging | kafka | ✅ | | Messaging | rabbitmq | ✅ | | Messaging | solace | ✅ | | Notebook | jupyter | ✅ | | Object Storage | minio | ✅ | | Query Engine | duckdb | ✅ | | Query Engine | flight-sql | ✅ | | Query Engine | presto | ✅ | | Query Engine | trino | ✅ | | Real-time OLAP | clickhouse | ✅ | | Real-time OLAP | doris | ✅ | | Real-time OLAP | druid | ✅ | | Real-time OLAP | pinot | ✅ | | Test Data Management | data-caterer | ✅ | | Workflow | temporal | ✅ |
Generation and Validation
Since it uses data-caterer behind the scenes to help with data generation and validation, check the following pages for discovering what options are available.
Data Sources
The following data sources are available to generate/validate data.
| Data Source Type | Data Source | Support | | ---------------- | ---------------------------------- | ------- | | Cloud Storage | AWS S3 | ✅ | | Cloud Storage | Azure Blob Storage | ✅ | | Cloud Storage | GCP Cloud Storage | ✅ | | Database | Cassandra | ✅ | | Database | MySQL | ✅ | | Database | Postgres | ✅ | | Database | Elasticsearch | ❌ | | Database | MongoDB | ❌ | | Database | Opensearch | ❌ | | File | CSV | ✅ | | File | Delta Lake | ✅ | | File | JSON | ✅ | | File | Iceberg | ✅ | | File | ORC | ✅ | | File | Parquet | ✅ | | File | Hudi | ❌ | | HTTP | REST API | ✅ | | Messaging | Kafka | ✅ | | Messaging | Solace | ✅ | | Messaging | ActiveMQ | ❌ | | Messaging | Pulsar | ❌ | | Messaging | RabbitMQ | ❌ | | Metadata | Data Contract CLI | ✅ | | Metadata | Great Expectations | ✅ | | Metadata | Marquez | ✅ | | Metadata | OpenAPI/Swagger | ✅ | | Metadata | OpenMetadata | ✅ | | Metadata | Open Data Contract Standard (ODCS) | ✅ | | Metadata | Amundsen | ❌ | | Metadata | Datahub | ❌ | | Metadata | Data Contract CLI | ❌ | | Metadata | Solace Event Portal | ❌ |
Examples
Simple Example
services: []
run:
- command: ./my-app/run-app.sh
test:
generation:
parquet:
- options:
path: /tmp/parquet/accounts
schema:
fields:
- name: account_id
validation:
parquet:
- options:
path: /tmp/parquet/accounts
validations:
- expr: ISNOTNULL(account_id)
- aggType: count
aggExpr: count == 1000
Full Example
services:
- name: postgres #define external services
data: my-data/sql #initial service setup (i.e. schema/tables, topics, queues)
run:
- command: ./my-app/run-postgres-extract-app.sh #how to run your application/job
env: #environment variables for your application/job
POSTGRES_URL: jdbc:postgresql://postgres:5432/docker
test:
env: #environment variables for data generation/validation
POSTGRES_URL: jdbc:postgresql://postgres:5432/docker
mount: #volume mount for data validation
- ${PWD}/example/my-app/shared/generated:/opt/app/shared/generated
relationship: #generate data with same values used across different data sources
postgres_balance.account_number: #ensure account_number in balance table exists when transaction created
- postgres_transaction.account_number
generation: #define data sources for data generation
postgres:
- name: postgres_transaction #give it a name to use in relationship definition
options: #configuration on specific data source
dbtable: account.transactions
count: #how many records to generate (1,000 by default)
perColumn: #generate 5 records per account_number
columnNames: [account_number]
count: 5
schema: #schema of the data source
fields:
- name: account_number
type: string
- name: create_time
type: timestamp
- name: transaction_id
type: string
- name: amount
type: double
- name: postgres_balance
options:
dbtable: account.balances
schema:
fields:
- name: account_number
options: #additional metadata for data generation
isUnique: true
regex: ACC[0-9]{10}
- name: create_time
type: timestamp
- name: account_status
type: string
options:
oneOf: [open, closed]
- name: balance
type: double
validation:
csv: #define data source for data validations
- options:
path: /opt/app/shared/generated/balances.csv
header: true
validations: #list of validation to run, can be basic SQL, aggregations, upstream data source or column name validations
- expr: ISNOTNULL(account_number)
- aggType: count
aggExpr: count == 1000
- options:
path: /opt/app/shared/generated/transactions.csv
header: true
validations:
- expr: ISNOTNULL(account_number)
- aggType: count
aggExpr: count == 5000
- groupByCols: [account_number]
aggType: count
aggExpr: count == 5
GitHub Action Options
Input
Optional configurations to alter the files and folders used by the GitHub Action can be found below.
| Name | Description | Default |
| -------------------- | -------------------------------------------------------------------------------------------- | ---------------------------------------- |
| configuration_file | File path to configuration file | insta-integration.yaml
|
| insta_infra_folder | Folder path to insta-infra (this repository) | ${HOME}/.insta-integration/insta-infra
|
| base_folder | Folder path to use for execution files | ${HOME}/.insta-integration
|
| data_caterer_version | Version of data-caterer Docker image | 0.12.1
|
To use these configurations, alter your
.github/workflows/integration-test.yaml
.
name: Integration Test
on:
push:
branches:
- *
jobs:
integration-test:
name: Integration Test
runs-on: ubuntu-latest
steps:
- name: Run integration tests
uses: data-catering/insta-integration@v1
with:
configuration_file: my/custom/folder/insta-integration.yaml
Output
If you want to use the output of the GitHub Action, the following attributes are available:
| Name | Description | | ----------------------- | ------------------------------------------------------------ | | num_records_generated | Total number of records generated. | | num_success_validations | Total number of successful validations. | | num_failed_validations | Total number of failed validations. | | num_validations | Total number of validations. | | validation_success_rate | Success rate of validations (i.e. 0.75 = 75% success rate). | | full_result | All result details as JSON (data generation and validation). |
For example, you can print out the results like below:
- name: Run integration tests
id: test-action
uses: data-catering/insta-integration@v1
- name: Print Output
id: output
run: |
echo "Records generated: ${{ steps.test-action.outputs.num_records_generated }}"
echo "Successful validations: ${{ steps.test-action.outputs.num_success_validations }}"
echo "Failed validations: ${{ steps.test-action.outputs.num_failed_validations }}"
echo "Number of validations: ${{ steps.test-action.outputs.num_validations }}"
echo "Validation success rate: ${{ steps.test-action.outputs.validation_success_rate }}"
JSON Schema for insta-integration.yaml
A JSON Schema has been created to
help guide users on what is possible in the insta-integration.yaml
. The links
below show how you can import the schema in your favourite IDE:
Validate JSON Schema
Using the following tool ajv
.
Validate the JSON Schema:
ajv compile --spec=draft2019 -s schema/insta-integration-config-latest.json
Validate a insta-integration.yaml
file:
ajv validate --spec=draft2019 -s schema/insta-integration-config-latest.json -d example/postgres-to-csv.yaml
Example Flows
Why Integration Test?
It is the closest you get to simulating production. This involves:
- Test your application/job end-to-end
- Connect to data sources
- Production-like data processed