Self-Healing AI Systems: From Detection to Automatic Repair

Invited talk at RMAIIG presenting the reliability loop framework (from failure detection through severity classification, synthetic reproduction, and verified repair) for agentic AI systems in regulated industries.

Presented at RMAIIG's "When AI Systems Fail" event alongside Ken Fricklas and Beth Rudden, this talk challenges the assumption that detecting AI failures is the hard problem. Drawing from a production deployment in financial auditing, the talk walks through a real failure pattern: a self-planning audit agent that fabricated regulatory citations to support its work plan, a failure that passed human review undetected but was caught by Reins AI's reference traceability monitor.

From there, the talk introduces the reliability loop (monitor, triage, simulate, repair, verify) not as a new invention, but as established reliability engineering practice (the same loop used in aviation, power grids, and space systems) now made practical for AI by the telemetry agentic systems naturally emit. Key topics include FMEA-based severity classification, the risk cube for prioritizing failures by probability and exposure, and why synthetic simulation is essential in regulated industries where reviewers cannot access production data directly.

The talk closes with open research questions on evaluator drift, improvement conflicts in adaptation, and the push toward OpenTelemetry standards for the repair layer.

Watch the talk (YouTube)

Our other articles

All articles