run-pipeline

v0.0.37

Published

3 years ago

Run docker containers in parallel. locally or cloud. actively used for Bioinformatics pipelines at Radiant Genomics

Downloads

131

0High
0Medium
0Low

dchandran

Docker AWS EC2 pipeline bioinformatics

Run Pipeline

Command-line interface to running docker containers in parallel.

Developed at Radiant Genomics for bioinformatics needs.

MIT License

Features include:

run locally or in the cloud (currently only Amazon cloud)
parallelize across many files in a folder or a set of files
configure memory usage and CPU usage
stores execution history in searchable database
stores execution logs in searchable database
automatically monitor all execution logs, e.g. a specific error message
stats on all executions, and lots of other command line options to control and monitor executions

The framework is built on Docker and Kubernetes.

overview

###A typical use case

Create docker image from docker folder and config file (described in later sections)

run-pipeline build -df <path to docker folder> -c <path to config file> --mode <aws or local>

Start a run

run-pipeline start -c <path to config file> -n <name of run> --mode <aws or local>

Monitor the runs

#Check general stats
run-pipeline stats -n <name of run> --mode <aws or local>

#Check logs (stdout and stderr)
run-pipeline log -n <name of run> --mode <aws or local>

#Look for keywords in the logs every 15 min
run-pipeline log -n <name of run> --grep "Exception|error" --watch 15m --mode <aws or local>

##Install

Install using npm.

npm install -g run-pipeline

The installation process will ask whether you want to install all the pre-requisites, such as mongo, AWS cli, and Kubernetes. Answer 'yes' only if you are on a Debian-based Linux operating system because this process makes use of apt-get. If you are not on a Debian-based Linux operatating system, take a look at the prerequisites list to see what needs to be installed.

Add the following variables to your environment (modify .bashrc). DEFAULT_MODE is the default value for --mode (allowed values: local or aws). AWS variables are not required if you are using local mode.

AWS_ACCESS_KEY_ID=**paste your aws access key id here**
AWS_SECRET_ACCESS_KEY=**paste your aws access key secret here**
AWS_DEFAULT_REGION=us-west-2
AWS_DEFAULT_AVAILABILITYZONE=us-west-2c
AWS_KEYPAIR=**your key pair**
AWS_ARN=**create an aws arn**
DEFAULT_MODE=local

##Config file

The command-line interface relies on a configuration file. The configuration file is used to specify the amount of RAM and/or CPU, the inputs, and the outputs for a single execution of a pipeline. Below is a simple hello-world example. Other example configuration files can be found in the /examples folder. The files for creating the docker containers are located in a separate repository

Local configuration

name: Hello World Run 3
author: Deepak
description: testing

pipeline:
  image: helloworld:latest
  command: /run.sh
  memory_requirement: 120M
  volumes:
    shared:
      hostPath: /home/deepak/mydata

local_config:
  cpu_cores: 0-2

output:
  location: /home/deepak/mydata/output

inputs:
  location: /home/deepak/mydata/inputs
  file_names:
    -
      inputA: f1
      inputB: r1
    -
      inputA: f2
      inputB: r2
    -
      inputA: f3
      inputB: r3

Cloud configuration

name: Hello World Run 3
author: Deepak
description: testing

pipeline:
  repository: 570340283117.dkr.ecr.us-west-2.amazonaws.com
  image: helloworld:latest
  command: /run.sh
  public: no
  memory_requirement: 120M

cloud_config:
  workerCount: 2
  workerInstanceType: t2.nano
  workerRootVolumeSize: 100

local_config:
  cpu_cores: 0-2

output:
  location: s3://some/location

inputs:
  location: s3://some/location
  file_names:
    -
      inputA: f1
      inputB: r1
    -
      inputA: f2
      inputB: r2
    -
      inputA: f3
      inputB: r3

##Command-line interface

usage: runPipeline.js [-h] [-c path] [-n name] [-df path] [--grep regex]
                      [--cmd command] [-t name] [-q key-value]
                      [--watch interval] [-v] [--mode MODE] [--overwrite]
                      [--limit number]
                      option

Run dockerized pipelines locally or in Amazon Cloud

Positional arguments:
  option                Options: "init" to initialize a cluster, "build" to
                        build the docker container, "run" or "start" to start
                        tasks, "stop" to stop tasks, "kill" to destroy a
                        cluster, "log" to get current or previous logs,
                        "status" to get status of running tasks, "search" to
                        search for previous run configs

Optional arguments:
  -h, --help            Show this help message and exit.
  -c path, --config path
                        Cluster configuration file (required for init)
  -n name, --name name  Name of run (if config file is also specified, then
                        uses this name instead of the one in config file)
  -df path, --dockerfile path
                        Specify the dockerfile that needs to be built
  --grep regex          Perform grep operation on the logs. Use only with the
                        log option
  --cmd command, --exec command
                        Execute a command inside all the docker containers.
                        Use only with the log option
  -t name, --task name  Specify one task, e.g. task-1, instead of all tasks.
                        Use only with log, status, run, or stop.
  -q key-value, --query key-value
                        For use with the search program, e.g. name: "Hello"
  --watch interval      Check log every X seconds, minutes, or hours. Input
                        examples: 5s, 1m, 1h. Only use with run or log.
                        Useful for monitoring unattended runs.
  -v, --verbose         Print intermediate outputs to console
  --mode MODE, -m MODE  Where to create the cluster. Options: local, aws
  --overwrite           Overwrite existing outputs. Default behavior: does
                        not run tasks where output files exist.
  --limit number        Use with search or log to limit the number of outputs.

###Examples

Build a docker image for local usage:

run-pipeline build --config helloworld.yaml --dockerfile ./inputs/helloworld/Dockerfile

Build and upload a docker image for Cloud usage:

run-pipeline build --config helloworld.yaml --dockerfile ./inputs/helloworld/Dockerfile --mode aws

For local use, the Run will automatically initialize and run a pipeline:

run-pipeline start --config helloworld.yaml

Turn on automatic logging while running. In this case, watch the logs every 1 min:

run-pipeline start --config helloworld.yaml --watch 1m

Use --verbose to see the commands that are being executed by runPipeline.

For cloud usage, you would need to initialize the pipeline before running:

run-pipeline init --config helloworld.yaml --mode aws

run-pipeline start --config helloworld.yaml -m aws

#or you can use the name of the run
run-pipeline start --name "Hello World" -mode aws

#shorthand
run-pipeline start -n "Hello World" -m aws

A task is defined as a pipeline and a specific input (or set of input files). For example, in the above configuration example, there are three tasks: the first using f1 and r1, the second using f2 and r2, and the last using f3 and r3 input files.

Check the status of each task or a specific task

run-pipeline status -c helloworld.yaml -m aws
run-pipeline status -c helloworld.yaml -m aws -t task-1

Colored and indented console outputs

screenshot_status

Obtain the logs from each task or a specific task

run-pipeline logs helloworld.yaml -m aws
run-pipeline logs helloworld.yaml --task "task-1" -m aws

grep (find) a specific phrase (regex) inside logs

run-pipeline logs helloworld.yaml --grep "grep Downloads -A 2"

Execute a command inside each task or a specific task

run-pipeline logs helloworld.yaml --cmd "cat /tmp/file1.txt"
run-pipeline logs helloworld.yaml --cmd "cat /tmp/file1.txt" -t task-1

Search old logs

run-pipeline logs --query "run: Hello World, createdAt: 2016-04-10"

Search old run configurations

run-pipeline search --query "name: Hello World, createdAt: 2016-06-10"

Restart a task

run-pipeline restart -c inputs/HS4000_plate3_bfcspadesmeta.yaml -t task-1

Restart tasks automatically when specific keywords are found in the logs or status. Note that regular expressions are allowed.

./auto-restart.js "HiSeq Plate 3 SPAdes" --logs '(not connect to the endpoint)|(different)' --status 'reason: Error|OOMKilled'

Check for specific key-value pairs in the status

run-pipeline status -c inputs/HS4000_plate3_bfcspadesmeta.yaml --query "status: exit"
run-pipeline status -c inputs/HS4000_plate3_bfcspadesmeta.yaml --query "status: running"

Keep checking logs every 60 seconds (and store in database) - useful for unattended runs

#monitor free memory every 60 seconds
run-pipeline logs helloworld.yaml --cmd "free -h" --watch 60s

Look for key words in the logs and print 5 lines after it

run-pipeline log -c inputs/HS4000_plate3_bfcspadesmeta.yaml --grep "'different number' -A 5"

Run a configuration using its name (gets latest version in the database)

run-pipeline start -n "Hello World"

Useful commands for monitoring memory and disk usage inside each task

#monitor free memory
run-pipeline logs helloworld.yaml --cmd "free -h"

#monitor CPU and memory usage
run-pipeline logs helloworld.yaml --cmd "ps aux"

#monitor free disk space
run-pipeline logs helloworld.yaml --cmd "df -h"

Divide executions into categories, i.e. namespace. The executions will use the same cluster (when run in AWS)

run-pipeline start helloworld.yaml --namespace A
run-pipeline start helloworld.yaml --namespace B

#get logs from different pipelines in the same cluster
run-pipeline status helloworld.yaml --namespace A
run-pipeline logs helloworld.yaml --namespace B

When using --mode aws, you may use Kubernetes directly for features not provided by runPipeline.

cd Hello-World
kubectl --kubeconfig=kubeconfig get nodes
kubectl --kubeconfig=kubeconfig describe node ip-10-0-0-111.us-west-2.compute.internal

Creating a pipeline

A pipeline is a Docker image that uses environmental variables to define input and output files. A pipeline can be executed multiple times, in parallel, using different inputs and/or outputs.

Creating and running a Docker container

Building and executing docker containers is described in another repository

Cloud-ify Docker

In order to create a pipeline that is amenable to cloud computing, you need to upload the image to a docker repository, such as the AWS repository or Docker Hub. Below are the steps for uploading to the AWS docker repository.

Login to AWS console
Go to the EC2 Container Service link in your AWS console
Create a new repository with the name of the pipeline, e.g. helloworld.
Build the docker image using the appropriate tag. Obtain the tag by combining the repository URL and the image name. For the helloworld example, the command is as follows:

docker build -t 570340283117.dkr.ecr.us-west-2.amazonaws.com/helloworld:latest ./helloworld

Push the newly created image to the repository. Again, replace the tag name with your docker image's tag.

docker push 570340283117.dkr.ecr.us-west-2.amazonaws.com/helloworld:latest

##Issues to keep in mind

###When running locally

Docker can consume hard disk space. Always cleanup unused docker images. ./runPipeline kill will destroy all containers; if you do not use kill, you should also cleanup stopped docker containers. Useful commands to put in your .bashrc file:

# Kill all running containers.
alias dockerkillall='docker kill $(docker ps -q)'

# Delete all stopped containers.
alias dockercleanc='printf "\n>>> Deleting stopped containers\n\n" && docker rm $(docker ps -a -q)'

# Delete all untagged images.
alias dockercleani='printf "\n>>> Deleting untagged images\n\n" && docker rmi $(docker images -q -f dangling=true)'

When running in Amazon cloud

AWS Container Service will place multiple containers on the same EC2 instance. This decision is based on the amount of memory and number of cores required by the container, so be sure to specify the memory-requirement in the input file correctly. Otherwise, random processes may start failing due to overuse of EC2 instances
The kill command may not correctly kill a cluster and cloudformation components due to dependencies. It is difficult to identify all the dependencies programmatically; in such cases, you have to go to the cloudformation link in the AWS Console and manually delete those stacks. In my observation, the EC2 instances are terminated properly with the kill command. AWS will not allow you to create more clusters if the maximum number of VPCs has been reached. VPCs should get deleted along with the clusters when kill is executed, but this is not always the case. You need to delete them manually. From the AWS discussion groups, this appears to be an issue that is affecting many people.
AWS only a allows a limited number of Virtual Private Clouds (VPC), which are created by the ECS service for each cluster. Deleting VPCs is difficult through the command-line AWS interface because they have other dependencies. Deleting the VPCs from the AWS Console is relatively easy, however.

##Prerequisites

Docker, see instructions here: https://docs.docker.com/engine/installation/ Add yourself to docker group.
Python 2.7 or greater.
Use pip to install the following Python packages: pyaml, python-dateutil, awscli
NPM and NodeJS version 6+
Kubernetes 1.2.4. Download from https://github.com/kubernetes/kubernetes/releases/download/v1.2.4/
MongoDB. Google will tell you how to install it.

##License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Run Pipeline

Contents

Creating a pipeline

Creating and running a Docker container

Cloud-ify Docker

When running in Amazon cloud