Evaluation & Monitoring System Architect (Agentic Systems & Simulation)

Architect evaluation, monitoring, and simulation systems (from frameworks to workflows to tooling) so AI teams can operate agentic architectures reliably at scale.

Role Overview

Reins AI is hiring an Evaluation & Monitoring Architect to design the operating model for AI reliability in regulated domains. You’ll define the end-to-end architecture (agentic frameworks, evaluation data flows, observability integrations, scorecards, and release gates), and extend those same architectures into simulations: mirrored agentic environments that let us generate synthetic telemetry, run validators, and stress-test reliability loops before production.

You’ll partner with our Adaptation and Product teams to ensure every improvement ships with regression coverage, measurable reliability gains, and reusable MCP-style services for clients.

Responsibilities

  • System Architecture: Define the reference architecture for evaluation, monitoring, and agentic observability (ingest → evaluate → triage → verify → score → report).
  • Evaluator Frameworks: Standardize evaluator patterns (task success, hallucination/factuality, adherence to procedure, suitability/reliability) with well-defined APIs and regression tests.
  • Observability Integration: Integrate traces and metrics (LangSmith/OpenInference, OpenTelemetry, Arize, Grafana) with dashboards and SLOs for agentic and multi-agent systems.
  • Scorecards & Gates: Establish reliability KPIs (pass-rates, variance, MTTR, calibration) and “ready-to-ship” gates; automate backtests and regressions.
  • Workflows & Handoffs: Design triage queues, escalation paths, and ownership models so Delivery/Client teams can operate independently.
  • Governance: Define test-set stewardship (golden traces, thresholds, update cadence), versioning, change logs, and audit trails.
  • Enablement: Produce playbooks, runbooks, and quick-start guides; deliver internal/client training.
  • Partnerships: Work with Test Designers (what to measure), Implementation Engineers (how it runs), and Adaptation (what to change) to close the loop.

Qualifications

  • 6+ years in ML/AI evaluation, data/ML platform, or reliability/observability architecture.
  • Strong Python + data engineering fundamentals; comfort with cloud (GCP/Azure), containers, CI/CD.
  • Expertise with monitoring and tracing tools (LangSmith, OpenTelemetry, Arize, Grafana).
  • Applied statistics for evaluation (sampling, confidence intervals, inter-rater agreement, calibration).
  • Excellent systems thinking and cross-functional communication.
  • Familiarity with multi-agent orchestration frameworks (LangGraph, Semantic Kernel, CrewAI, etc.).

Preferred Skills

  • Background in regulated domains (audit, finance, healthcare).
  • Experience with simulation or synthetic data generation.
  • Familiarity with MCP frameworks or plugin-based service architectures.
  • Understanding of agentic/HITL workflows and AI safety/reliability concerns.

Employment Details

This will start as a 4-6 month contract engagement (20 hours/week) with a clear path to full-time employment as we finalize 2026 project scopes. We’ll jointly evaluate fit, scope, and structure during that period.
Optimal start date:
December 15, 2025