# Re-running partial batches

You might have a need to re-run some subset of tests inside a batch. Maybe one experience encountered a transient error. Maybe it has non-deterministic behaviour. But you don't want to re-run all of the tests in a batch, just one or a few problematic ones.

## Limitations

The tests will be re-run as if they were being freshly run now. They do not take a snapshot of what the configuration was when they were first run and re-use that data. This means that if you have updated any of the dependencies, including **system**, **image**, **experience data**, etc - then the test will run with the latest definitions of that data.

When re-running tests, both the experience and metrics stages for that test will be re-run. The metrics stage will run against only the results from the new run; results from previous runs will be ignored.

It is not currently possible to only re-run the metrics stage for a test re-using previously generated experience output.

The Batch Metrics stage will always be re-run. It will reuse the output from any test that did not need to be re-run, combining it with the fresh output from any re-run tests, ignoring old output for anything that was superseded. Batch Metrics may also be re-run in isolation, without needing to re-run any tests within the batch.

## CLI instructions

There are two ways to re-run batches using the CLI.

### Automated reruns using `batches supervise`

The `supervise` command under resim batches is responsible for monitoring a batch run until completion, and then re-runs the subset of failed tests automatically for a set number of times.

#### Configuration

The `supervise` command takes several required and optional parameters. To see full documentation of each, please refer to `resim batches supervise --help`.

### Required parameters

- `project`: the name or ID of the project to supervise
- `max-rerun-attempts`: maximum number of rerun attempts for failed tests (default: 1)
- `rerun-max-failure-percent`: maximum percentage of failed jobs before stopping (1-100, default: 50). This is intended to guard against re-running and wasting compute on a faulty version and only to rerun for flakiness of tests. For example, if more than 50% of tests fail, it is more likely that the code under test is failing vs failure due to non-determinism.
- `rerun-on-states`: Status of test results to trigger automatic reruns on (e.g. Warning, Error, Blocker)

#### Optional parameters

- `batch-id`: the ID of the batch to supervise
- `batch-name`: the name of the batch to supervise. One of `batch-name` or `batch-id` is necessary.
- `poll-every`: interval between checking batch status, expressed in Golang duration string (default: "30s")
- `wait-timeout`: amount of time to wait for a batch to finish, expressed in Golang duration string (default: "1h")

#### Exit codes

The supervise command returns an exit code corresponding to the final batch status:

- `0`: Success
- `1`: Internal error
- `2`: Error
- `5`: Cancelled
- `6`: Timed out

The exit codes can be used on CI pipelines to pass or fail a stage.

#### Example

Bash

```
resim batches supervise \
    --project "my-project" \
    --batch-name "my-batch-name" \
    --max-rerun-attempts 3 \
    --rerun-max-failure-percent 30.0 \
    --rerun-on-states "Error,Warning" \
    --poll-every "1m" \
    --wait-timeout "2h"
```

### Manual re-run using `batches rerun`

If there is a need to partially re-run certain tests to test for flaky failures manually, we can use the `rerun` command.

To re-run one or more tests:

Bash

```
resim batches rerun --batch-id={batch id} --test-ids={test id 1, test id 2, ...}
```

Specify the test IDs (aka Job IDs) from the batch in question. Do not use (eg) experience IDs.

To re-run just the Batch Metrics stage, simply specify no test IDs:

Bash

```
resim batches rerun --batch-id={batch id}
```
