Extending OpenTelemetry: A Proposed Repair Layer for Agentic AI

Last week I filed a proposal with the OpenTelemetry GenAI SIG (issue #3597) proposing a new semantic convention namespace for something the community hasn't tackled yet: the remediation layer of agentic AI systems.

The GenAI SIG has done remarkable work building out the observation layer with standardized telemetry for model calls, agent steps, tool execution, and evaluation results. That foundation makes what comes next possible. This proposal is an attempt to build on it: defining what happens after a failure is detected, how repairs are structured, and whether they actually worked.

‍

The detection layer is genuinely good

The OpenTelemetry GenAI semantic conventions have matured rapidly. You can now trace model calls, agent steps, tool execution, and evaluation results with standardized, portable telemetry. Platforms like Arize and LangSmith have built sophisticated dashboards on top of these conventions.

But detection is only the beginning of the reliability story.

When an agentic system fails, when it produces an incorrect output, misapplies a procedure, or hallucinates a document reference, the observability stack tells you it happened. It might even tell you how often, and under what conditions. The natural next question is: what was done about it, and did it work?

There are currently no standard conventions for failure severity classification, corrective actions, or verification outcomes. No standard mechanism for linking repair attempts over time to show whether a system is getting more reliable. The remediation layer is the next frontier, and it is wide open.
‍

‍
Why this is harder for agentic systems than traditional software

In traditional software, bugs are discrete and reproducible. You can pull the failing test case, fix the code, run the test again. The feedback loop is tight and the evidence of improvement is unambiguous.

Agentic systems don't work that way. Their failures are emergent, arising from the interaction of models, users, context, and data in ways that weren't anticipated at build time. A failure that occurred on Tuesday with one user's document may not reproduce with a similar document on Wednesday, because the system state is different, the retrieval returned different context, or the model's stochastic output took a different path.

This is not a bug. It's a property of complex systems. And it means that the repair loop for agentic systems requires something traditional software doesn't: synthetic reconstruction of failure conditions.

You can't always reuse the original data. In many domains it's protected. You can't simply re-run the failing test because the conditions aren't stable enough. You have to recreate the failure in a controlled environment, using synthetic data that mirrors the structure and stressors of the original, and then verify that your repair holds up across a representative set of variants.

This is exactly what reliability engineering for complex systems has always done. Aerospace, medical devices, military systems all face the same fundamental challenge: how do you improve a system that fails in ways you can't fully anticipate, using evidence you can't always directly reuse?

The answer, developed over decades, is a structured loop: detect, classify by severity, reproduce, repair, verify. Failure Mode and Effects Analysis. Reliability growth modeling. Repair packets with acceptance criteria. These aren't new ideas. They're established engineering practice, and they translate directly to agentic AI.

‍
What `gen_ai.repair.*` proposes

The proposal defines a repair span, a structured telemetry record of a single repair cycle, with attributes that capture the remediation layer:

The severity of the failure, classified by operational impact
The case signature, meaning the pattern of conditions that produced it
The root cause hypothesis
Whether synthetic data was used to reproduce it
The corrective action taken
Whether the repair was verified, and how effectively
A link to the prior repair attempt on the same failure

That last attribute, gen_ai.repair.parent_repair_id, is the one I find most interesting. It creates a chain across repair iterations. The sequence of repair spans for a given failure is the reliability growth curve, expressed as telemetry. For the first time, you could ask your observability platform: is this system actually getting more reliable over time?

‍

‍
Why standards matter here

You might reasonably ask: why does this need to be a standard? Can't teams just implement their own repair tracking?

They can, and some do. But the value of a standard isn't that it enables something previously impossible. It's that it makes the thing portable, comparable, and auditable across tools and organizations.

If gen_ai.repair.* becomes an adopted convention, repair data becomes queryable in any OTel-compatible platform. Reliability metrics become comparable across deployments. And the synthetic bench used to verify a repair becomes a standardized artifact rather than a one-off local test, something with provenance that can be referenced, shared, and reused.

That changes the economics of repair considerably. Instead of each team rebuilding the scaffolding from scratch, the scaffolding becomes infrastructure.

‍

What I'm looking for

The proposal is open and I'd love collaborators. If you work on AI observability, reliability engineering, or agentic systems in production, especially in domains where repair evidence needs to be auditable, your perspective would be genuinely valuable, either in the GitHub issue or directly.

The community that built the detection layer has done something worth building on. I'm hoping this proposal is a useful next step in that conversation.

‍

The full proposal is at open-telemetry/semantic-conventions #3597. The reliability engineering framework behind it is described in the Reins AI white paper: Reliability and Repair for Agentic Systems.

‍

Extending OpenTelemetry: A Proposed Repair Layer for Agentic AI

The detection layer is genuinely good

‍
Why this is harder for agentic systems than traditional software

‍
What `gen_ai.repair.*` proposes

‍
Why standards matter here

What I'm looking for

Our other articles

A score is not a judgment

The Structure Is In The Work

The Intelligence Was in the Organization All Along

The Fitness Model for Agentic AI

Nobody Thought About the Drivers