Most agent demos get a very generous audience.

People forgive bad latency, weird tool use, fragile state, and unclear failure modes because the demo has a good story.

That is fine for a demo. It is a bad way to judge a system.

The standard I care about is simpler:

  • can I rerun it?
  • can I inspect it?
  • can I explain the failure?
  • can I trust the same workflow tomorrow?

If the answer is no, the demo is just a pretty temporary state.

This is also why I keep coming back to local setups and harnesses. They are less impressive on first glance and more useful after the third run.

That is the right tradeoff for me.