Optimizing build images

Introduction

Running your tests at scale in ReSim involves running the image as quickly as possible on multiple cloud instances.

This document is intended to explain the way images are handled in ReSim and to assist customers in optimising their images. Spending some time to make your images smaller will save time and money, not only when running tests in ReSim but also when storing and distributing images in your own environments.

N.B.: This document refers to single images, however note that if you are using our multi-container builds feature, images are processed in the same way.

ReSim's image handling architecture

When you register a build image with ReSim, our Mirror service will pull the image from your registry and store it in ReSim's platform. Then once a batch is launched with that build image, the Mirror service checks it has an up-to-date copy of the image, and the ReSim platform launches cloud compute instances to run the image with the experiences configured in the batch.

This means that your image is transferred from one place to another at the following times:

When a batch is launched, the Mirror service checks that it has an image in its storage that matches the tag and digest. If not, it authenticates and pulls the image.
- If the Mirror service determines that it needs to update its copy of the image, it transfers only the layers of the image that have changed. This indicates a significant potential optimisation, discussed below (see Order layers below).
- If your image is hosted in a registry other than AWS ECR in us-east-1, for example in AWS ECR in us-west-2 or Google Artifact Registry, this will incur egress costs for you according to the size of the image, meaning that reducing the size of your image will save time and money here. This cost is also potentially incurred in other cases, for example when you transfer images into hardware in the field.
When tests are running, the instances running the test each pull the image from our mirror service.
- The number of instances we launch is carefully optimized - once you are running more than a small number of tests in a batch, we launch fewer instances and re-use them for tests in the same batch. This means that the image is pulled a number of times equal to the number of instances launched, and possibly not equal to the number of tests. Because pulling and extracting images - even from our Mirror service which is closely co-located with test instances - takes time, smaller images will reduce the time it takes for your batches to run.

architecture-beta
    group customer_cloud(cloud)[Customer Cloud]

    service customer_registry(disk)[Image Registry] in customer_cloud


    group resim(cloud)[ReSim]
    service mirror(disk)[Image Mirror] in resim
    service instance1(server)[Instance] in resim
    service instance2(server)[Instance] in resim
    service instance3(server)[Instance] in resim

    mirror:R -- L:instance1
    mirror:T -- L:instance2
    mirror:B -- L:instance3

    customer_registry:R -- L:mirror

Potential optimizations

There is a lot of discussion online about optimizing container images (we've provided some links to further reading below). Here we have outlined some of the more impactful changes you can make, in our experience. Some of these are trade-offs, where you may be trading complexity or maintenance overhead for smaller images that are cheaper and quicker to handle, and some are outright improvements.

Use a lean base image

Generic base images can be very large, as they include packages to suit the potential needs of many users (text editors, SSH servers, compilers etc.). Consider starting with a minimal image and adding only the packages and dependencies you need.

Many images publish a slim variant which is useful for this purpose.

Order layers

In your Dockerfile, commands that modify the filesystem (such as ADD, COPY and RUN) cause a new layer to be created. Those layers are cached "in order", so if a command at the top of the file means that next time the build is run, the layer it produces has changed, all layers "below" that one in the Dockerfile will be invalidated and rebuilt.

When using the image in ReSim, this means that our Mirror service will need to pull all of the newly-invalidated layers, and not just the most recently changed layer.

As a concrete example, suppose you had a Dockerfile like this:

FROM ubuntu

# let's record which commit this is (argument value provided by build command)
ARG GIT_SHA
RUN echo $GIT_SHA > /.build-version

# install dependencies and build our application
RUN apt-get update && apt-get install -y build-essential
COPY main.c Makefile /src/
WORKDIR /src/
RUN make build

This is a contrived example, but in this case every time the image is built with a different GIT_SHA value, the RUN echo $GIT_SHA > /.build-version layer will change, and all layers that run after it (the apt-get install, ADD, and make build commands) will be invalidated and rebuilt, even though those dependencies and source files may not have changed.

This means that all of: running the build, uploading the image to long-term storage (because docker push and similar commands are layer-aware), and transferring the image to ReSim will take longer than they need to.

If you really wanted that .build-version file, putting it at the end of the Dockerfile (or at least after the expensive operations) would mean that the RUN apt-get... and later commands would not be invalidated by it.

Don't install unnecessary packages

You should only install the packages you need to run your application. Development tools and other utilities make images less secure, slower to build, and larger. See Consider using multi-stage builds below for an example of how to build a smaller image that only contains your application and its runtime dependencies.

Install packages in one command

If you are installing multiple packages, install them in one command. Splitting the installations into separate commands leads to inefficient layer caching, affecting both build time and overall image size.

For example:

RUN apt-get update && apt-get install -y build-essential \
    python3 \
    python3-pip \
    libcudnn8 \
    <...other required packages> \
    && rm -rf /var/lib/apt/lists/* # See below

Clear or disable package manager caches

Package managers cache information about the packages available in configured repositories. This can take up a surprising amount of space.

For apt:

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    <...other required packages> \
    && rm -rf /var/lib/apt/lists/*

Note that if you split the installation and cache removal into separate RUN commands the cache will appear to be removed in the final image, but it will still be cached in the prior layers, taking up space.

For apk (Alpine's package manager):

RUN apk add --no-cache <packages>

This both updates package information and prevents apk storing any.

Consider using multi-stage builds

This is an advanced topic, but for image size optimization and security it can be best to produce an image that only contains your compiled binary, if it suits your application.