run-pipeline
v0.0.37
Published
Run docker containers in parallel. locally or cloud. actively used for Bioinformatics pipelines at Radiant Genomics
Downloads
58
Maintainers
Readme
Run Pipeline
Command-line interface to running docker containers in parallel.
Developed at Radiant Genomics for bioinformatics needs.
Features include:
- run locally or in the cloud (currently only Amazon cloud)
- parallelize across many files in a folder or a set of files
- configure memory usage and CPU usage
- stores execution history in searchable database
- stores execution logs in searchable database
- automatically monitor all execution logs, e.g. a specific error message
- stats on all executions, and lots of other command line options to control and monitor executions
The framework is built on Docker and Kubernetes.
###A typical use case
- Create docker image from docker folder and config file (described in later sections)
run-pipeline build -df <path to docker folder> -c <path to config file> --mode <aws or local>
- Start a run
run-pipeline start -c <path to config file> -n <name of run> --mode <aws or local>
- Monitor the runs
#Check general stats
run-pipeline stats -n <name of run> --mode <aws or local>
#Check logs (stdout and stderr)
run-pipeline log -n <name of run> --mode <aws or local>
#Look for keywords in the logs every 15 min
run-pipeline log -n <name of run> --grep "Exception|error" --watch 15m --mode <aws or local>
Contents
##Install
- Install using npm.
npm install -g run-pipeline
The installation process will ask whether you want to install all the pre-requisites, such as mongo, AWS cli, and Kubernetes. Answer 'yes' only if you are on a Debian-based Linux operating system because this process makes use of apt-get. If you are not on a Debian-based Linux operatating system, take a look at the prerequisites list to see what needs to be installed.
- Add the following variables to your environment (modify .bashrc).
DEFAULT_MODE
is the default value for--mode
(allowed values:local
oraws
). AWS variables are not required if you are using local mode.
AWS_ACCESS_KEY_ID=**paste your aws access key id here**
AWS_SECRET_ACCESS_KEY=**paste your aws access key secret here**
AWS_DEFAULT_REGION=us-west-2
AWS_DEFAULT_AVAILABILITYZONE=us-west-2c
AWS_KEYPAIR=**your key pair**
AWS_ARN=**create an aws arn**
DEFAULT_MODE=local
##Config file
The command-line interface relies on a configuration file. The configuration file is used to specify the amount of RAM and/or CPU, the inputs, and the outputs for a single execution of a pipeline. Below is a simple hello-world example. Other example configuration files can be found in the /examples
folder. The files for creating the docker containers are located in a separate repository
Local configuration
name: Hello World Run 3
author: Deepak
description: testing
pipeline:
image: helloworld:latest
command: /run.sh
memory_requirement: 120M
volumes:
shared:
hostPath: /home/deepak/mydata
local_config:
cpu_cores: 0-2
output:
location: /home/deepak/mydata/output
inputs:
location: /home/deepak/mydata/inputs
file_names:
-
inputA: f1
inputB: r1
-
inputA: f2
inputB: r2
-
inputA: f3
inputB: r3
Cloud configuration
name: Hello World Run 3
author: Deepak
description: testing
pipeline:
repository: 570340283117.dkr.ecr.us-west-2.amazonaws.com
image: helloworld:latest
command: /run.sh
public: no
memory_requirement: 120M
cloud_config:
workerCount: 2
workerInstanceType: t2.nano
workerRootVolumeSize: 100
local_config:
cpu_cores: 0-2
output:
location: s3://some/location
inputs:
location: s3://some/location
file_names:
-
inputA: f1
inputB: r1
-
inputA: f2
inputB: r2
-
inputA: f3
inputB: r3
##Command-line interface
usage: runPipeline.js [-h] [-c path] [-n name] [-df path] [--grep regex]
[--cmd command] [-t name] [-q key-value]
[--watch interval] [-v] [--mode MODE] [--overwrite]
[--limit number]
option
Run dockerized pipelines locally or in Amazon Cloud
Positional arguments:
option Options: "init" to initialize a cluster, "build" to
build the docker container, "run" or "start" to start
tasks, "stop" to stop tasks, "kill" to destroy a
cluster, "log" to get current or previous logs,
"status" to get status of running tasks, "search" to
search for previous run configs
Optional arguments:
-h, --help Show this help message and exit.
-c path, --config path
Cluster configuration file (required for init)
-n name, --name name Name of run (if config file is also specified, then
uses this name instead of the one in config file)
-df path, --dockerfile path
Specify the dockerfile that needs to be built
--grep regex Perform grep operation on the logs. Use only with the
log option
--cmd command, --exec command
Execute a command inside all the docker containers.
Use only with the log option
-t name, --task name Specify one task, e.g. task-1, instead of all tasks.
Use only with log, status, run, or stop.
-q key-value, --query key-value
For use with the search program, e.g. name: "Hello"
--watch interval Check log every X seconds, minutes, or hours. Input
examples: 5s, 1m, 1h. Only use with run or log.
Useful for monitoring unattended runs.
-v, --verbose Print intermediate outputs to console
--mode MODE, -m MODE Where to create the cluster. Options: local, aws
--overwrite Overwrite existing outputs. Default behavior: does
not run tasks where output files exist.
--limit number Use with search or log to limit the number of outputs.
###Examples
Build a docker image for local usage:
run-pipeline build --config helloworld.yaml --dockerfile ./inputs/helloworld/Dockerfile
Build and upload a docker image for Cloud usage:
run-pipeline build --config helloworld.yaml --dockerfile ./inputs/helloworld/Dockerfile --mode aws
For local use, the Run will automatically initialize and run a pipeline:
run-pipeline start --config helloworld.yaml
Turn on automatic logging while running. In this case, watch the logs every 1 min:
run-pipeline start --config helloworld.yaml --watch 1m
Use --verbose to see the commands that are being executed by runPipeline.
For cloud usage, you would need to initialize the pipeline before running:
run-pipeline init --config helloworld.yaml --mode aws
run-pipeline start --config helloworld.yaml -m aws
#or you can use the name of the run
run-pipeline start --name "Hello World" -mode aws
#shorthand
run-pipeline start -n "Hello World" -m aws
A task is defined as a pipeline and a specific input (or set of input files). For example, in the above configuration example, there are three tasks: the first using f1 and r1, the second using f2 and r2, and the last using f3 and r3 input files.
Check the status of each task or a specific task
run-pipeline status -c helloworld.yaml -m aws
run-pipeline status -c helloworld.yaml -m aws -t task-1
Colored and indented console outputs
Obtain the logs from each task or a specific task
run-pipeline logs helloworld.yaml -m aws
run-pipeline logs helloworld.yaml --task "task-1" -m aws
grep (find) a specific phrase (regex) inside logs
run-pipeline logs helloworld.yaml --grep "grep Downloads -A 2"
Execute a command inside each task or a specific task
run-pipeline logs helloworld.yaml --cmd "cat /tmp/file1.txt"
run-pipeline logs helloworld.yaml --cmd "cat /tmp/file1.txt" -t task-1
Search old logs
run-pipeline logs --query "run: Hello World, createdAt: 2016-04-10"
Search old run configurations
run-pipeline search --query "name: Hello World, createdAt: 2016-06-10"
Restart a task
run-pipeline restart -c inputs/HS4000_plate3_bfcspadesmeta.yaml -t task-1
Restart tasks automatically when specific keywords are found in the logs or status. Note that regular expressions are allowed.
./auto-restart.js "HiSeq Plate 3 SPAdes" --logs '(not connect to the endpoint)|(different)' --status 'reason: Error|OOMKilled'
Check for specific key-value pairs in the status
run-pipeline status -c inputs/HS4000_plate3_bfcspadesmeta.yaml --query "status: exit"
run-pipeline status -c inputs/HS4000_plate3_bfcspadesmeta.yaml --query "status: running"
Keep checking logs every 60 seconds (and store in database) - useful for unattended runs
#monitor free memory every 60 seconds
run-pipeline logs helloworld.yaml --cmd "free -h" --watch 60s
Look for key words in the logs and print 5 lines after it
run-pipeline log -c inputs/HS4000_plate3_bfcspadesmeta.yaml --grep "'different number' -A 5"
Run a configuration using its name (gets latest version in the database)
run-pipeline start -n "Hello World"
Useful commands for monitoring memory and disk usage inside each task
#monitor free memory
run-pipeline logs helloworld.yaml --cmd "free -h"
#monitor CPU and memory usage
run-pipeline logs helloworld.yaml --cmd "ps aux"
#monitor free disk space
run-pipeline logs helloworld.yaml --cmd "df -h"
Divide executions into categories, i.e. namespace. The executions will use the same cluster (when run in AWS)
run-pipeline start helloworld.yaml --namespace A
run-pipeline start helloworld.yaml --namespace B
#get logs from different pipelines in the same cluster
run-pipeline status helloworld.yaml --namespace A
run-pipeline logs helloworld.yaml --namespace B
When using --mode aws
, you may use Kubernetes directly for features not provided by runPipeline.
cd Hello-World
kubectl --kubeconfig=kubeconfig get nodes
kubectl --kubeconfig=kubeconfig describe node ip-10-0-0-111.us-west-2.compute.internal
Creating a pipeline
A pipeline is a Docker image that uses environmental variables to define input and output files. A pipeline can be executed multiple times, in parallel, using different inputs and/or outputs.
Creating and running a Docker container
Building and executing docker containers is described in another repository
Cloud-ify Docker
In order to create a pipeline that is amenable to cloud computing, you need to upload the image to a docker repository, such as the AWS repository or Docker Hub. Below are the steps for uploading to the AWS docker repository.
- Login to AWS console
- Go to the EC2 Container Service link in your AWS console
- Create a new repository with the name of the pipeline, e.g. helloworld.
- Build the docker image using the appropriate tag. Obtain the tag by combining the repository URL and the image name. For the helloworld example, the command is as follows:
docker build -t 570340283117.dkr.ecr.us-west-2.amazonaws.com/helloworld:latest ./helloworld
- Push the newly created image to the repository. Again, replace the tag name with your docker image's tag.
docker push 570340283117.dkr.ecr.us-west-2.amazonaws.com/helloworld:latest
##Issues to keep in mind
###When running locally
- Docker can consume hard disk space. Always cleanup unused docker images.
./runPipeline kill
will destroy all containers; if you do not usekill
, you should also cleanup stopped docker containers. Useful commands to put in your.bashrc
file:
# Kill all running containers.
alias dockerkillall='docker kill $(docker ps -q)'
# Delete all stopped containers.
alias dockercleanc='printf "\n>>> Deleting stopped containers\n\n" && docker rm $(docker ps -a -q)'
# Delete all untagged images.
alias dockercleani='printf "\n>>> Deleting untagged images\n\n" && docker rmi $(docker images -q -f dangling=true)'
When running in Amazon cloud
AWS Container Service will place multiple containers on the same EC2 instance. This decision is based on the amount of memory and number of cores required by the container, so be sure to specify the memory-requirement in the input file correctly. Otherwise, random processes may start failing due to overuse of EC2 instances
The kill command may not correctly kill a cluster and cloudformation components due to dependencies. It is difficult to identify all the dependencies programmatically; in such cases, you have to go to the cloudformation link in the AWS Console and manually delete those stacks. In my observation, the EC2 instances are terminated properly with the kill command. AWS will not allow you to create more clusters if the maximum number of VPCs has been reached. VPCs should get deleted along with the clusters when kill is executed, but this is not always the case. You need to delete them manually. From the AWS discussion groups, this appears to be an issue that is affecting many people.
AWS only a allows a limited number of Virtual Private Clouds (VPC), which are created by the ECS service for each cluster. Deleting VPCs is difficult through the command-line AWS interface because they have other dependencies. Deleting the VPCs from the AWS Console is relatively easy, however.
##Prerequisites
Docker, see instructions here: https://docs.docker.com/engine/installation/ Add yourself to docker group.
Python 2.7 or greater.
Use pip to install the following Python packages: pyaml, python-dateutil, awscli
NPM and NodeJS version 6+
Kubernetes 1.2.4. Download from https://github.com/kubernetes/kubernetes/releases/download/v1.2.4/
MongoDB. Google will tell you how to install it.
##License
MIT License
Copyright (c) 2016 Deepak Chandran
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.