Cost-efficient and scalable ML-experiments in AWS with spot-instances, Kubernetes and Horovod

At Rosebud AI we invent new tools for authoring and editing visual content. We combine established computer graphics techniques with cutting-edge AI-research.

Working with novel AI-techniques requires us to rapidly deploy and experiment. In this post, I will outline how we run our experiments at scale while balancing convenience, cost and making as few modifications to the code as possible.

Training DL-models at scale requires scalable computation. Cloud computing offers that. At Rosebud AI we perform most DL-experiments on AWS-cloud, but the advice can be adapted to other cloud-providers.

In this article you will see how with minimal changes to our ML-code we:

  • provision spot compute-resources using AWS's EKS,
  • represent ML-experiments as Kubernetes jobs,
  • manage weights and experiment-summaries, and
  • perform single-machine distributed training on multiple GPUs.

Storage in the Cloud

Let's look at storage for a minute. Unfortunately, it's not trivial. AWS offers (among others) the following options:

  • S3: cheap, global and infinitely scalable object-based storage accessible via API. There is a fixed (and relatively big) penalty for each get/put operation. Conveniently, multiple clients can concurrently access and modify data.
  • EFS: expensive, global (and practically infinitely scalable) network file-share. Bandwidth is expensive and spotty, but multiple clients can mount the same file-share, without using any external libraries.
  • EBS: cheap, zonal (EBS volume belongs to a particular availability-zone within a region) single-instance equivalent of a regular hard-drive. EBS supports creating snapshots and restoring from them. Snapshots are stored in S3.
credit: https://dzone.com/articles/using-amazon-efs-for-container-workloads

For our experiments we use the following setup:

  • EFS for model-checkpoints and summaries. This provides several benefits: all our weights are always in the same location, and available to all instances of the experiments. We can also run a single instance of TensorBoard to show summaries from all experiments.
  • EBS for datasets. Before we run experiments, we create a separate EBS volume for each one (remember, EBS volume can only be attached to a single instance at a time, therefore we need as many copies of the dataset as the number of parallel experiments we run). Usually, we instantiate EBS volumes from a master snapshot. Dataset-EBS is practically read-only: we only read our dataset from it. Once we are done with an experiment we can safely delete the corresponding EBS volume or repurpose it for the next experiment using the same dataset. Repurposing EBS volume is a good idea: fresh EBS-volume restored from snapshot is slow, because every block of the volume is read from S3 on first access. Repurposed EBS volume is "hot" and fast.

Spot Instances: Cheap Compute

To keep the cost of cloud-compute low, the best strategy is to use spot-instances (also known as preemptible instances in GCP or low-priority VMs in Azure). Spot-instances are regular VMs, which vendor sells at a steep discount (priced as low as one third or one-quarter of the full price) because they are sold from the vendor's spare compute capacity. When full-price demand for a certain instance-type reaches vendor's total capacity, spot-instances are shutdown with a short (2 minutes in AWS-cloud) advance notice.

credit: http://tech.nextroll.com/blog/dev/ops/2018/10/15/x-marks-the-spot.html

This appears to be very inconvenient, but in practice is very manageable with modern orchestration tools.

Managing Compute-Resources with EKS and Kubernetes

Every cloud-provider offers a range of possible hardware configurations for its VMs/compute-instances (on-demand or spot). For the purpose of DL-training we are interested in instance-types with GPUs. At Rosebud AI we mostly use: p3.2xlarge, p3.16xlarge - having correspondingly 1x and 8x Nvidia V100 GPUs.

To manage cloud-compute in AWS we use AWS's EKS service. It allows us to conveniently manage instances using declarative file-based manifests. EKS also has the benefit of turning all provisioned instances into nodes of a Kubernetes cluster (that's the "K" in EKS). We will get to the Kubernetes part shortly.

credit: https://aws.amazon.com/blogs/compute/run-your-kubernetes-workloads-on-amazon-ec2-spot-instances-with-amazon-eks/

One-time setup

In this section, I will describe the steps which we only need to perform once, to have a cluster, ready to be scaled up and down to run the experiments.

AWS's EKS has a convenient command-line tool eksctl. It enables the following abstraction:

  • we define the desired state of our cluster (by describing it in a file and "applying" it to the cluster with eksctl),
  • and EKS then does whatever is necessary to keep the actual state of the cluster as close to the desired state as possible.

EKS cluster itself (without any compute-instances in it) can be described as:

$ cat cluster.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: rosebud-experiments-cluster
  region: us-east-1
  version: "1.14"

vpc:
  subnets:
    public:
      us-east-1a: { id: SUBNET_IN_US_EAST_1A }

Here, the interesting parts are:

  • metadata.name - that's how the cluster will be named in AWS console, and how we will refer to it when using eksctl
  • vpc.subnets.public - these are AWS's VPC networks, which cluster will attach compute-instances to. In the simplest scenario, we add all of your AWS's VPC networks here. In this example, I only make one subnet available to the cluster.

We can create an EKS cluster using this manifest with the following command:

$ eksctl create cluster -f cluster.yaml

This command will use default AWS credentials profile.

Once the EKS cluster is created, we want to make Kubernetes's management tool - kubectl - to be aware of it. This is done with:

$ aws eks --region us-east-1 update-kubeconfig --name rosebud-experiments-cluster

We have a minimal Kubernetes cluster by now, without much interesting functionality in it. A few Kubernetes add-ons are important to have and must be "installed" into this new cluster.

First, let's make cluster aware of GPUs attached to compute-instances:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml

Make cluster logs visible in AWS's CloudWatch:

$ curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/master/k8s-yaml-templates/quickstart/cwagent-fluentd-quickstart.yaml | sed "s/{{cluster_name}}/rosebud-experiments-cluster/;s/{{region_name}}/us-east-1/" | kubectl apply -f -

Make cluster support AWS's EFS (network file-system):

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-efs-csi-driver/master/deploy/kubernetes/manifest.yaml

Make cluster support AWS's EBS (cloud "hard drives"):

$ kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"

At this point our cluster still has no compute-resources. A desired scalable set of spot-instances is called a "node group" and is described as:

$ cat nodegroup.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: rosebud-experiments-cluster
  region: us-east-1

vpc:
  subnets:
    public:
      us-east-1a: { id: SUBNET_IN_US_EAST_1A }

nodeGroups:
  - name: rosebud-gpu-nodegroup-us-east-1a
    minSize: 0
    maxSize: 10
    availabilityZones: ["us-east-1a"]
    instancesDistribution:
      instanceTypes: ["p3.8xlarge", "p3.2xlarge"]
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudWatch: true
        ebs: true
        efs: true
    securityGroups:
      attachIDs: ['EFS_SECURITY_GROUP']

There are a few interesting pieces here:

  • nodeGroups[0].name is the name of the node-group
  • nodeGroups[0].minSize and nodeGroups[0].maxSize are the limits of the size for the node-group
  • nodeGroups[0].availabilityZones are the zones where we will launch our instances in. For simplicity, we use a single zone here.
  • nodeGroups[0].instanceDistribution.instanceTypes describes the spot-instance-types will be provisioning. Two types here have 1x and 8x V100s.
  • nodeGroups[0].securityGroups.attachIDs[0] is the security group associated with EFS where we store weights and summaries. This is necessary to make EFS connectable to cluster-instances.

We can "apply" the definition of this node-group to our cluster with:

$ eksctl create nodegroup -f nodegroup.yaml

At this point, our cluster still has no compute-resources, but it knows what we might request from it.

A few more one-time setups. We need to make our cluster aware of the EFS-instance, where we are storing weights and summaries. We describe it in:

$ cat efs.yaml

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-training-artifacts
provisioner: efs.csi.aws.com
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-training-artifacts-pv
spec:
  capacity:
    storage: 500Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-training-artifacts
  csi:
    driver: efs.csi.aws.com
    volumeHandle: YOUR_EFS_ID
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-training-artifacts-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-training-artifacts
  resources:
    requests:
      storage: 500Gi

The only interesting thing here is:

  • spec.csi.volumeHandle which is the ID of our EFS-instance for weights and summaries

We apply EFS-configuration to our cluster with:

$ kubectl apply -f efs.yaml

We also need to make cluster aware of EBS storage type. It's described in:

$cat ebs.yaml

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Retain

And we "apply" it to the cluster with:

$ kubectl apply -f ebs.yaml


Let's review what we've done here:

  • we created an empty-cluster,
  • configured it to be aware of GPUs, once they become available as compute-resource,
  • defined which instances and how we will be provisioning for the cluster,
  • defined storage, which we will be using in the cluster.

Defining Compute-Tasks

Much like with declaring compute-resources of the cluster, we will use Kubernetes to declaratively describe the compute-tasks, to be scheduled on the cluster's compute-resources.

credit: https://www.oreilly.com/library/view/programming-kubernetes/9781492047094/ch01.html

In a nutshell, Kubernetes is the system, which compares desired state of the computational-tasks with the actual state, and attempts to bring them to parity. Kubernetes is aware of what computational-resources are available in the cluster and takes that into account when scheduling tasks. For example, if we describe the task as requiring 8 GPUs and a particular EBS volume, Kubernetes will only attempt to start the task if volume is free (not attached elsewhere), and there are 8 free GPUs on an instance, such that volume is attachable to that instance.

Any Kubernetes-task has a Docker image at its core. Docker image is an object, which bundles code with all its dependencies. Once Docker image is built, it will predictably execute the command inside it, wherever it is launched.

I will go into more detail about building a Docker image later, but for now let's assume that we have it built.

ML experiment can be described as a Kubernetes "Job":

$ cat job.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: hungry-davinci-ffhq-d
  labels:
    job-name: hungry-davinci
spec:
  capacity:
    storage: 20Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  storageClassName: ebs-sc
  csi:
    driver: ebs.csi.aws.com
    volumeHandle: DATASET_EBS_VOLUME_ID
    fsType: ext4
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.ebs.csi.aws.com/zone
          operator: In
          values:
          - us-east-1d
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: hungry-davinci-ffhq-claim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ebs-sc
  selector:
    matchLabels:
      job-name: hungry-davinci
  resources:
    requests:
      storage: 20Gi
---
apiVersion: batch/v1
kind: Job
metadata:
  name: hungry-davinci
  labels:
    job-name: hungry-davinci
spec:
  backoffLimit: 4
  template:
    metadata: 
      name: hungry-davinci-training
      labels:
        job-name: hungry-davinci
    spec: 
      restartPolicy: Always
      containers: 
        - name: hungry-davinci-training
          image: {{DOCKER_IMAGE_URL}}
          imagePullPolicy: Always
          command: ["horovodrun"]
          args:
          - "-np"
          - "1"
          - "-H"
          - "localhost:1"
          - "python"
          - "-u"
          - "main.py"
          - "--batch_size=4"
          - "--epochs=10000"
          - "--learning_rate=0.001"
          - "--model_dir=/training-artifacts/experiments/hungry-davinci"
          - "--summaries_dir=/training-artifacts/experiments/summaries/hungry-davinci"
          - "--samples=/input-dataset/in-the-wild-images"
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
          - name: training-artifacts
            mountPath: /training-artifacts
          - name: input-dataset
            mountPath: /input-dataset
          stdin: true 
          tty: true 
      volumes:
      - name: training-artifacts
        persistentVolumeClaim:
          claimName: efs-training-artifacts-claim
      - name: input-dataset
        persistentVolumeClaim:
          claimName: hungry-davinci-ffhq-claim

There are several interesting variable parts here:

  • spec.csi.volumeHandle references the ID of the EBS volume with our data-set
  • in spec.volumes we declare that our experiment will be using EFS with weights and summaries, as well as EBS with the dataset
  • in spec.containers[0].volumeMounts we declare that weights&summaries, and dataset will be available to our code as folder correspondingly /training-artifacts and /input-dataset
  • in spec.containers[0].image we reference the Docker image, which contains experiment's code and all dependencies
  • in spec.containers[0].command and spec.containers[0].args we describe the command to run inside the instantiated Docker image. This is the command that starts the experiments.
  • spec.kind and spec.backoffLimit states that we have a "Job"-type computational-task, which must be run until completion or be abandoned if it failed more than 3 times. Kubernetes has other kinds of tasks (for example, "Deployment") which are designed for constantly-running processes and aren't a good fit for finite ML-experiments.

At Rosebud AI we use a template-file for a Job-manifest and only have to replace EBS-volume ID, experiment's hyperparameters passed as command-line arguments to a container, and experiment's name. The rest of the job's manifest remains unchanged.

Now we have a description of the experiment and fully-configured cluster with no resources. To run an experiment we need to execute two commands:

$ eksctl scale nodegroup --cluster=rosebud-experiments-cluster --name=rosebud-gpu-nodegroup-us-east-1a --nodes=1
$ kubectl apply -f job.yaml

First command will scale our empty cluster to have one spot- GPU-enabled instance. The second will schedule an experiment.

We could run only the second command and Kubernetes would wait until resources are available before starting the experiment. In fact, that's exactly what happens as spot-instances disappear and become available again: Kubernetes will stop the experiment and resume once matching compute is available.

At this point we can create job-manifests for the whole set of experiments and schedule them all at once:

$ kubectl apply -f ./experiments/

Kubernetes will then keep track of experiments completion and schedule them on the cluster as resources become available.

At any moment in time we can see active cluster-nodes, and jobs:

$ kubectl get nodes
$ kubectl get jobs

After we are done with experimenting, we can scale the cluster back to zero instances with:

$ eksctl scale nodegroup --cluster=rosebud-experiments-cluster --name=rosebud-gpu-nodegroup-us-east-1a --nodes=0

Code Modifications

We only need to make a few minor modifications to the code to make it work in this setting:

  • We keep track of when we last checkpointed our model's weights, and checkpoint at least once every 10 minutes. If we loose spot-instance and experiment is interrupted, we loose at most 10 minutes of training-time.
  • We save checkpoints and summaries somewhere in /training-artifacts folder - because that's the directory, which persists in EFS, and is made available to the code inside Docker-container.
  • Expect to use dataset located at /input-dataset
  • Automatically restore training from the latest checkpoint (if there is one) when training-experiments starts.

Single-machine multi-GPU distributed training

When using multiple GPUs for DL-training, we use vanilla all-reduce distributed minibatch gradient descent:

  • each GPU/worker holds its own copy of all variables, and independently performs forward and backward passes on a minibatch to calculate the gradients (if using ┬ámap-reduce vocabulary, this corresponds to a "map" stage - we "map" a minibatch to gradients)
  • workers send calculated gradients to other workers ("shuffle" stage)
  • once every worker has the copy of gradients calculated by every other worker, the worker averages the gradients and applies them to its copy of the weights ("reduce" stage)
  • at this point, all workers again have identical values of all weights, because each worker applied the average of the same set of gradients
  • after a worker completes its "reduce" stage, it processes next minibatch to produce the next set of gradients.
credit: https://eng.uber.com/horovod/

Currently, we rely on Horovod to enable distributed training with minimal code-modification.

Horovod supports inter-worker gradients-passing over the network between machines. Multi-machine distributed DL-tasks requires a very good understanding of and fine control over the networking aspect of the cloud, to guarantee that network-bottlenecks don't negate the benefits of using multple GPUs. We chose to avoid this complication and only use Horovod on multiple GPUs within a single instance, so that gradients-passing happens locally between processes.

If workers are slow at exchanging gradients in a distributed-training setting, then each worker will spend more time waiting for gradients ("shuffle" stage) and less time computing its own gradients, thus increasing GPU idle-time.

Using Horovod is trivial with Docker images. The Horovod-team provides a set of pre-built Docker images, which bundle Horovod with the latest version of popular machine-learning frameworks. The Horovod-project has great examples of how existing code can be adapted for use in distributed setting.

With Horovod, the only thing requiring attention is that one of the workers is always a master-worker (has rank/ID equal to zero). There are a few parts of the training process which only should be executed on a master-worker (if hvd.rank() == 0):

  • Saving and restoring checkpoints. After master-worker has detected that there is a checkpoint to restore the model from, and initialized variables from it, it broadcasts variable-values to all other workers. It is the same broadcast-step which must happen when training starts from scratch, to ensure that all workers have identical variables-initialization. Broadcasting-step only happens once before regular gradients-passing loop begins.
  • Writing summaries.
  • Producing artifacts for human-evaluation.

Lastly, if using randomization when splitting dataset into test and training sets, the random-seed must be fixed. This will ensure that split is identical between experiment's restarts and also across distributed workers.

Final Thoughts

The strategy, as outlined in this article, allows us to quickly design and run ML-experiments. There are future improvements:

  • use standard Kubernetes templating tools to create Job-manifests;
  • standardize experiments-transfer from local infrastructure to the cloud;
  • integrate Kubernetes Job-manifests with hyperparameter-tuning libraries, so that experiments can be automatically created and scheduled on the cluster.

Like to learn more about what we are building and apply cutting edge research to creative tooling? Shoot us an email at info@rosebudai.com

Show Comments