Replayable Benchmarks

Replayable failure is the product.

If a model only looks good once, I do not trust the result.

What I want from a benchmark is not a bigger screenshot. I want a harness that can answer a few boring questions:

That is the part people skip when they talk about agent performance.

The model matters. The score matters. But the workflow is the thing you actually ship.

The practical test is simple:

If you cannot do that, you do not have a benchmark yet.

You have a story.

I care about replayability because it makes failures cheaper to use.

The first failure tells you something broke. The second failure tells you whether the harness can show you the same thing again.

That is when the result starts becoming useful.