Multi-Container Builds

Introduction

A truly single-node system is hard to find in the real world. Many robots are made up of multiple compute nodes working together, and even a robot that had just a single node onboard would typically interact with an offboard system to some degree.

The ReSim platform supports running multiple containers together in a single experience, to better simulate your real world systems. This may be useful as a representation of your multi-node robot, or multiple robots working together, or your robot communicating with an offboard cloud service, or similar.

Docker Compose

We support defining your experience's multiple containers via a docker-compose file. You can see an example that we use for testing here.

Our execution environment supports a subset of the configuration options of docker-compose. The expectation is that you can pass ReSim the same compose file that you would use for running your system locally on Docker, without any changes. However, not all options in the file will take effect in the ReSim platform. Most unsupported options will be ignored, but a small number that are strictly incompatible will surface as errors that will prevent your batch from running.

Volumes

We support the definition of named volumes in your docker-compose file. Each volume specified in this way will be created ephemerally for the duration of your running experience. It is available to mount as read/write into any or all services in your experience as per normal docker-compose behavior.

In addition to any named volumes, the standard /tmp/resim/ input/output directories will be automatically mounted into every container.

Networks

Custom networks are not supported. All services will be run in a single default network. Your experience will not expose any ports to the outside world. Each service that listens on a port must listen on a unique port, as containers may share the same host and port space, and thus conflict.

Services

Each service is run as a separate container. Services may use the same image, or different images. See build image setup for more information on how to create a build image for your service and make it available to ReSim.

Startup dependencies

Unlike Docker, depends_on does not control start up ordering. If your containers have startup dependencies they must fail healthchecks (or exit) and expect to be restarted continuously until their dependencies are running and they can start up correctly.

See Container Lifecycle below for how depends_on works with container restarts.

Restart

See Container Lifecycle below for restart.

Resources

Service resource requests are supported. The resource requests of all containers must sum up to less than the total amount of resources requested in your system. If per-container resource requests are not specified, your containers will share the available resources without any resource guarantees - they may, for example, be OOM-killed.

You can request GPUs for a given service.

services:
  my-sim:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              count: 1

Services will not share GPUs; each service reserving a GPU will receive the requested number of dedicated GPUs.

Other

Other compose service blocks that should function as expected:

entrypoint / command
workingdir
healthcheck

Environment Variables

Environment variables can be passed in from your compose file. They may also be set at the experience level. Environment variables set at the experience level will take precedence over those set in your compose file. Environment variables set at the experience level will be set across all services.

Profiles

Profiles can be used for services. The specific profile to use can be set at the experience level.

resim experiences create [...] --profile {name}

Container Lifecycle

When testing a multi-node system, it's typical to have a number of nodes representing the various parts of the system, plus a single node that actually controls the running of the test, often referred to as the "test orchestrator". Nodes representing parts of a robot, for example, may not have things like "graceful shutdown" handling. So if we wanted to wait for all containers to exit to finish a test, that might not be possible.

Our system focuses on the "test orchestrator" for lifecycle management, such that you do not need to add complex startup or shutdown logic to other containers just for test purposes.

At least one container must have a label of resim.ai/isTestOrchestrator = true.

When monitoring a test, a non-zero exit code in any orchestrator container will fail the test. When all orchestrator containers have exited with a zero exit code, the test is considered successful.

For non-orchestrator containers, their lifecycle is handled in two phases.

Initially, if a container depends_on another container, any failure of that container will be ignored until after their dependency is satisfied - when the other container is running.

After a container has satisfied dependencies (if any), any failure of that container will fail the test, unless it has restart defined, in which case it will happily restart without affecting the test. An orchestrator will not restart.

Non-orchestrator containers may exit successfully at any time.

Summary of capabilities

	Orchestrator Containers	Non-Orchestrator Containers
`depends_on`	Exits are ignored before dependencies are satisfied	Exits are ignored before dependencies are satisfied
exit code non-zero with `restart`	Not possible	Container restarts
exit code non-zero without `restart`	Fails the test	Fails the test
exit code zero	Succeeds the test when all exit	Ignored, not restarted

Termination on Failure

When the test enters a failed state due to an inappropriate container exit or timeout, any other running containers will be terminated.

Runtime Considerations

Container Logs

Logs from each container will be available in the ReSim web app, as experience-CONTAINERNAME-container.log

Container State

Resource metrics (CPU, memory) are automatically collected for each container. They can be viewed in the ReSim web app.

Cross-container Communication

Your containers will be able to communicate with each other over the default network, by using the service name as the hostname. For example, if you have a service defined as my-service, you can communicate with port 8080 on that container with by connecting to my-service:8080.

Creating a Multi-Container Build

Once you have a compose file created, it is as simple as passing it to resim builds create in place of where you would use ---image to pass in a single image URI:

resim builds create --branch=my-branch --system=my-system --version=1.0.0 --build-spec=my-docker-compose.yml

Once created, you can define and run test suites and batches as normal using this build ID. The ReSim web app will show the full compose file for each experience defined this way.

Direct API usage

If you use the API directly, the compose file is passed in as the buildSpecification field as part of the createBuildForSystem or createBuildForBranch inputs. It can be passed in as either YAML or JSON string content.

Helpful Tips

Image Tags

Every image in the compose file should be fully specified with the tag for the desired version of that image to be pulled and run. Make sure the compose file specifies these, they are not parameterized later.

In some testing scenarios, you may not need all images in the environment to be freshly built for each test, and you might find it more efficient to use different tags across images, and only change some of the image tags on each test.

services:
  onboard-node-a:
    image: my-image-a:test-build-1-sha
  onboard-node-b:
    image: my-image-b:test-build-1-sha
  cloud-service-node:
    image: my-image-c:stable-v1