Accepted at the Workshop on Agentic Software Engineering (AgenticSE), co-located with the ACM Conference on AI and Agentic Systems (CAIS 2026), San Jose, CA, May 26, 2026.
Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens (Reins AI), and Heather Frase (Veraitech).
Most agentic systems enter production before they are reliable enough for task-level benchmarks to be meaningful. In audit, finance, healthcare, and legal services, where failures carry regulatory, financial, and reputational consequences, the standard validation path runs through capability benchmarks: accuracy on held-out sets, adversarial robustness, compliance against policy. But those benchmarks presuppose a system whose structural integration is already sound enough that its output reflects agent behavior. At early maturity that assumption fails: structural defects, not task-level errors, dominate the failure landscape, and they mask the very signal task-level monitors are built to detect. The standard instruments are not the right instruments yet.
This work asks a different question. When error detection isn't viable, what is monitoring for?
We present a monitoring and triage methodology that decomposes agentic system evaluation into three dimensions (quality, suitability, efficiency) at three monitoring scopes (within-run, cross-run, structural), and uses variance as a characterization signal rather than relying on aggregate means. Findings route through severity classification adapted from FMEA, concentrating human attention on the subset that actually warrants investigation. We evaluate the methodology on a synthetic testbed of 220 runs across 120 document bundles with controlled error injection, built with Simthetic and processed by an early-stage system with known integration defects.
Three results emerge. Monitor scope determines failure type: within-run monitors surface deterministic stage defects, cross-run monitors surface stochastic integration consequences, and a structural monitor identifies an integration gap with perfect consistency. Injected task-level errors are indistinguishable from clean baselines, confirming that structural defects mask task-level signal. And deterministic triage routes 97 percent of findings to automated tracking, concentrating human investigation on the 2 percent that reflect variable system behavior, a 43x reduction in review volume.
The takeaway is practical. For systems at this stage, effective monitoring begins with structural diagnosis, before error detection becomes viable at all. We propose a maturity-staging model in which monitoring transitions from structural characterization to error detection to reliability tracking as integration defects are resolved. The taxonomy, the variance-based scope characterization, and the severity model transfer across document-driven, multi-stage agentic workflows in regulated industries; the specific calibrations are domain-specific.
Deploy monitoring early. The first thing it finds is the most important thing to fix.




