Agent Reliability Lead

Design the triage→remediation reliability system (including frameworks, workflows, dashboards, and gates) so delivery teams can run it at scale.

Role Overview

At Reins AI, the Agent Reliability Lead architects the closed-loop reliability program for agentic and human-in-the-loop systems in regulated audit/finance contexts. You’ll design the operating model, including the triage taxonomy, RCA workflows, remediation playbooks, regression coverage, dashboards, release gates. You'll then package it into tools, runbooks, and training so other teams can operate it. You’ll convert evaluator signals into a durable system that prioritizes fixes, ships improvements, and verifies impact with clear metrics.

Responsibilities

  • Program Architecture: Define the end-to-end reliability framework (triage taxonomy, severity/risk model, RCA patterns, remediation types, regression tests, release gates).
  • Workflow & Tooling Design: Specify dashboards, queues, handoffs, and automation (signals→ticketing, backtests, acceptance criteria) so squads can run the loop independently.
  • Metrics & Scorecards: Establish reliability KPIs (failure classes, MTTR, pass-rate deltas, variance caps, calibration) and reporting cadences.
  • Governance: Set test-set stewardship (golden traces, thresholds, update cadence) and change-management standards (release notes, change logs, audit trails).
  • Enablement & Handoff: Produce playbooks, runbooks, and quick-start guides; deliver training so Delivery/Client teams can execute without shadowing.
  • Partnerships: Align with Evaluation Design Lead (what to measure) and Engineering (how it runs) to ensure fixes come with regression coverage and clear acceptance.
  • Continuous Improvement: Review outcomes, refine
  • Qualifications

  • 6+ years in reliability/quality/program leadership for ML/AI or complex software (QA Lead, SRE/ML, Program Manager).
  • Track record designing operational systems that others run (process + tooling + metrics).
  • Fluency with evaluation/observability concepts and basic stats (sampling, CIs, calibration, agreement).
  • Excellent systems thinking, documentation, and stakeholder management in regulated contexts.
  • Preferred Skills

  • Audit/finance or other highly regulated domain experience.
  • Familiarity with agentic/HITL workflows, golden traces, and test governance.
  • Comfort reading API specs/data schemas; light Python/SQL; issue tracking (Linear/Jira).
  • 12 months, 50% blended.
    October 1, 2025