Re-Running Partial Batches

You might have a need to re-run some subset of tests inside a batch. Maybe one experience encountered a transient error. Maybe it has non-deterministic behaviour. But you don't want to re-run all of the tests in a batch, just one or a few problematic ones.

Limitations

The tests will be re-run as if they were being freshly run now. They do not take a snapshot of what the configuration was when they were first run and re-use that data. This means that if you have updated any of the dependencies, including system, image, experience data, etc - then the test will run with the latest definitions of that data.

When re-running tests, both the experience and metrics stages for that test will be re-run. The metrics stage will run against only the results from the new run; results from previous runs will be ignored.

It is not currently possible to only re-run the metrics stage for a test re-using previously generated experience output.

The Batch Metrics stage will always be re-run. It will reuse the output from any test that did not need to be re-run, combining it with the fresh output from any re-run tests, ignoring old output for anything that was superseded. Batch Metrics may also be re-run in isolation, without needing to re-run any tests within the batch.

CLI Instructions

There are two ways to re-run batches using the CLI.

Automated Reruns using `batches supervise`

The supervise command under resim batches is responsible for monitoring a batch run until completion, and then re-runs the subset of failed tests automatically for a set number of times.

Configuration

The supervise command takes several required and optional parameters. To see full documentation of each, please refer to resim batches supervise --help.

Required Parameters

project: the name or ID of the project to supervise
max-rerun-attempts: maximum number of rerun attempts for failed tests (default: 1)
rerun-max-failure-percent: maximum percentage of failed jobs before stopping (1-100, default: 50). This is intended to guard against re-running and wasting compute on a faulty version and only to rerun for flakiness of tests. For example, if more than 50% of tests fail, it is more likely that the code under test is failing vs failure due to non-determinism.
rerun-on-states: Status of test results to trigger automatic reruns on (e.g. Warning, Error, Blocker)

Optional Parameters

batch-id: the ID of the batch to supervise
batch-name: the name of the batch to supervise. One of batch-name or batch-id is necessary.
poll-every: interval between checking batch status, expressed in Golang duration string (default: "30s")
wait-timeout: amount of time to wait for a batch to finish, expressed in Golang duration string (default: "1h")

Exit Codes

The supervise command returns an exit code corresponding to the final batch status: - 0: Success - 1: Internal error - 2: Error - 5: Cancelled - 6: Timed out

The exit codes can be used on CI pipelines to pass or fail a stage.

Example

resim batches supervise \
    --project "my-project" \
    --batch-name "my-batch-name" \
    --max-rerun-attempts 3 \
    --rerun-max-failure-percent 30.0 \
    --rerun-on-states "Error,Warning" \
    --poll-every "1m" \
    --wait-timeout "2h"

Manual re-run using `batches rerun`

If there is a need to partially re-run certain tests to test for flaky failures manually, we can use the rerun command.

To re-run one or more tests:

resim batches rerun --batch-id={batch id} --test-ids={test id 1, test id 2, ...}

Specify the test IDs (aka Job IDs) from the batch in question. Do not use (eg) experience IDs.

To re-run just the Batch Metrics stage, simply specify no test IDs:

resim batches rerun --batch-id={batch id}