Re-Running Partial Batches
You might have a need to re-run some subset of tests inside a batch. Maybe one experience encountered a transient error. Maybe it has non-deterministic behaviour. But you don't want to re-run all of the tests in a batch, just one or a few problematic ones.
Limitations
The tests will be re-run as if they were being freshly run now. They do not take a snapshot of what the configuration was when they were first run and re-use that data. This means that if you have updated any of the dependencies, including system, image, experience data, etc - then the test will run with the latest definitions of that data.
When re-running tests, both the experience and metrics stages for that test will be re-run. The metrics stage will run against only the results from the new run; results from previous runs will be ignored.
It is not currently possible to only re-run the metrics stage for a test re-using previously generated experience output.
The Batch Metrics stage will always be re-run. It will reuse the output from any test that did not need to be re-run, combining it with the fresh output from any re-run tests, ignoring old output for anything that was superseded. Batch Metrics may also be re-run in isolation, without needing to re-run any tests within the batch.
CLI Instructions
There are two ways to re-run batches using the CLI.
Automated Reruns using batches supervise
The supervise
command under resim batches is responsible for monitoring a batch run until completion, and then re-runs the subset of failed tests automatically for a set number of times.
Configuration
The supervise
command takes several required and optional parameters. To see full documentation of each, please refer to resim batches supervise --help
.
Required Parameters
project
: the name or ID of the project to supervisemax-rerun-attempts
: maximum number of rerun attempts for failed tests (default: 1)rerun-max-failure-percent
: maximum percentage of failed jobs before stopping (1-100, default: 50). This is intended to guard against re-running and wasting compute on a faulty version and only to rerun for flakiness of tests. For example, if more than 50% of tests fail, it is more likely that the code under test is failing vs failure due to non-determinism.rerun-on-states
: Status of test results to trigger automatic reruns on (e.g. Warning, Error, Blocker)
Optional Parameters
batch-id
: the ID of the batch to supervisebatch-name
: the name of the batch to supervise. One ofbatch-name
orbatch-id
is necessary.poll-every
: interval between checking batch status, expressed in Golang duration string (default: "30s")wait-timeout
: amount of time to wait for a batch to finish, expressed in Golang duration string (default: "1h")
Exit Codes
The supervise command returns an exit code corresponding to the final batch status:
- 0
: Success
- 1
: Internal error
- 2
: Error
- 5
: Cancelled
- 6
: Timed out
The exit codes can be used on CI pipelines to pass or fail a stage.
Example
resim batches supervise \
--project "my-project" \
--batch-name "my-batch-name" \
--max-rerun-attempts 3 \
--rerun-max-failure-percent 30.0 \
--rerun-on-states "Error,Warning" \
--poll-every "1m" \
--wait-timeout "2h"
Manual re-run using batches rerun
If there is a need to partially re-run certain tests to test for flaky failures manually, we can use the rerun
command.
To re-run one or more tests:
resim batches rerun --batch-id={batch id} --test-ids={test id 1, test id 2, ...}
Specify the test IDs (aka Job IDs) from the batch in question. Do not use (eg) experience IDs.
To re-run just the Batch Metrics stage, simply specify no test IDs:
resim batches rerun --batch-id={batch id}