Effortless automated testing at scale.
ReSim is a simulation-based evaluation platform for autonomous systems. Run thousands of virtual test scenarios in parallel, on every code change, with full metrics and CI integration.
-
Tutorials
New to ReSim? Start here for a guided, end-to-end walkthrough.
-
How-to Guides
Step-by-step instructions for setting up and operating ReSim.
-
Reference
Precise specs for the ReSim data model, metrics library, and open-core APIs.
-
Explanation
Understand the why behind ReSim's design and the practice of sim-based eval.
What is simulation-based evaluation?
Simulation-based evaluation (sim-based eval, or SBE) is the practice of testing a software system against a controlled set of virtual scenarios rather than deploying it in the real world. Instead of waiting for your robot or vehicle to encounter a situation on a test track, you recreate that situation digitally and run your system through it programmatically, at scale, whenever you want.
ReSim is built around this idea. Understanding it will help you get more out of the platform.
The core problem sim-based eval solves
Robots, autonomous vehicles, drones, and industrial machines share a testing challenge that pure software systems don't have. Their behavior depends on sensor data from a physical environment that's always changing. A motion planner that works in good lighting may fail at dusk. A perception system that handles dry roads may degrade in rain.
The obvious answer is simply to test in the real world. Go drive the car. Put the robot on the factory floor. But real-world testing runs into problems quickly:
It doesn't scale. A test run takes hours or days of calendar time, physical resources, and coordination. You might get through dozens of scenarios a week. You need thousands.
It's dangerous. Deploying software that hasn't been fully tested onto real hardware can damage equipment, endanger people, or cause mission-critical failures. There's no undo.
It's not reproducible. Two runs in "the same" conditions are never actually identical. Debugging a failure you can't reliably reproduce is painful.
It's too slow for CI. You can't block a pull request on a 4-hour physical test. Without fast feedback, changes pile up and confidence erodes.
Sim-based eval addresses all of this. Replace the physical environment with a virtual one and you can run thousands of scenarios in parallel, in minutes, safely, with deterministic results, triggered automatically on every code change.
What a simulation-based eval looks like
A sim-based eval has three ingredients:
Your system under test. The code you're evaluating (your perception stack, planner, controller, or full stack) packaged so it can run without real hardware. In ReSim, this is a Docker image called a Build.
A set of scenarios. The virtual situations you want to test against: sensor data, world state, initial conditions. These might come from a simulator (Isaac, Gazebo, a proprietary sim environment), from recorded field data replayed through your system, or from a combination of both. In ReSim, these are called Experiences.
A way to measure performance. Logic that takes your system's outputs (the decisions it made, the paths it took, what it detected) and produces scores and pass/fail verdicts. In ReSim, this is handled by the Metrics framework — or a Metrics Build if post-processing is required.
One run of your system against one scenario is a Test. Running across many scenarios is a Test Batch. The metrics across that batch tell you whether your system is performing well and whether it has improved or regressed since the last version.
Simulation vs. log replay vs. hardware-in-the-loop
"Sim-based eval" gets used to describe several related but distinct approaches, each with different tradeoffs.
Pure simulation feeds your system synthetic data generated by a simulator in real time. The simulator models the physical world (sensor noise, lighting, object dynamics) and your system interacts with it as though it were real. You can test scenarios that haven't happened yet and control conditions precisely. The tradeoff is sim-to-real gap: if the simulator doesn't accurately model reality, your test results may not reflect real-world performance.
Log replay feeds your system recorded real-world data: sensor logs, lidar scans, camera streams. Your system runs against the exact inputs it would have received in the field. You can only test situations that have actually occurred, but there's no sim-to-real gap for the inputs. This is particularly useful for regression testing, since a scenario your system handled correctly before should still work after a code change.
Hardware-in-the-loop (HiL) connects your software to real physical hardware (actuators, sensors, embedded systems) while keeping the environment virtual or controlled. Use it when the software-hardware interface itself needs testing, not just high-level behavior. ReSim supports this via the ReSim Agent.
Most mature testing programs use all three at different stages. Log replay is cheapest and fastest; pure simulation gives the widest coverage; HiL gives the highest fidelity for hardware integration.
Why sim-based eval requires a platform
Running one scenario against one build is easy. The operational complexity shows up at scale.
At 10,000 experiences, 50 builds in flight, multiple systems, and hundreds of metric types, with a whole team trying to interpret results at once, things get complicated fast. Which build ran against which experiences? Which metrics regressed between v47 and v48? Did the failure on scenario 3,847 reproduce? What was the p95 latency across all nighttime scenarios?
ReSim handles that layer: test execution, results storage, metrics aggregation, batch comparisons, artifact management, and CI integration. The goal is to make sim-based eval practical at the scale and cadence a real engineering team needs.
Sim-based eval in the development lifecycle
Sim-based eval works best as a continuous feedback loop, not a checkpoint you run before a release.
A healthy sim-based eval practice typically looks something like this:
A developer opens a pull request, which triggers a batch run against a curated suite of scenarios. If key metrics regress or tests fail, the CI check fails and the developer finds out before the code merges. Nightly or on releases, a larger batch runs to catch subtler regressions. When a new failure mode turns up in the field, it gets captured as an experience and added to the suite so it can never quietly regress again.
Over time, the experience suite becomes a precise characterization of what your system is supposed to do. New engineers can read the metrics and understand the system's requirements without digging through code. That's the real payoff: a shared, measurable, executable definition of expected behavior.
Ready to get started? Run your first test batch — takes about 20 minutes, no Docker required.