Evaluation Architect

Architect simulation, evaluation, and reliability systems (from frameworks to workflows to tooling) so AI teams can model, test, and operate complex agentic architectures reliably at scale.

Role Overview

Reins AI is hiring an Evaluation & Monitoring Architect to design the operating model for AI reliability in regulated domains. You’ll define the end-to-end architecture (agentic frameworks, evaluation data flows, observability integrations, scorecards, and release gates), and extend those same architectures into simulations: mirrored agentic environments that let us generate synthetic telemetry, run validators, and stress-test reliability loops before production.

You’ll partner with our Adaptation and Product teams to ensure every improvement ships with regression coverage, measurable reliability gains, and reusable MCP-style services for clients.

Responsibilities

  • System Architecture: Define the reference architecture for evaluation, monitoring, and agentic observability (ingest → evaluate → triage → verify → score → report).
  • Evaluator Frameworks: Standardize evaluator patterns (task success, hallucination/factuality, adherence to procedure, suitability/reliability) with well-defined APIs and regression tests.
  • Observability Integration: Integrate traces and metrics (LangSmith/OpenInference, OpenTelemetry, Arize, Grafana) with dashboards and SLOs for agentic and multi-agent systems.
  • Scorecards & Gates: Establish reliability KPIs (pass-rates, variance, MTTR, calibration) and “ready-to-ship” gates; automate backtests and regressions.
  • Workflows & Handoffs: Design triage queues, escalation paths, and ownership models so Delivery/Client teams can operate independently.
  • Governance: Define test-set stewardship (golden traces, thresholds, update cadence), versioning, change logs, and audit trails.
  • Enablement: Produce playbooks, runbooks, and quick-start guides; deliver internal/client training.
  • Partnerships: Work with Test Designers (what to measure), Implementation Engineers (how it runs), and Adaptation (what to change) to close the loop.

Qualifications

  • 6+ years in ML/AI evaluation, data/ML platform, or reliability/observability architecture.
  • Strong Python + data engineering fundamentals; comfort with cloud (GCP/Azure), containers, CI/CD.
  • Expertise with monitoring and tracing tools (LangSmith, OpenTelemetry, Arize, Grafana).
  • Applied statistics for evaluation (sampling, confidence intervals, inter-rater agreement, calibration).
  • Excellent systems thinking and cross-functional communication.
  • Familiarity with multi-agent orchestration frameworks (LangGraph, Semantic Kernel, CrewAI, etc.).

Preferred Skills

  • Background in regulated domains (audit, finance, healthcare).
  • Experience with simulation or synthetic data generation.
  • Familiarity with MCP frameworks or plugin-based service architectures.
  • Understanding of agentic/HITL workflows and AI safety/reliability concerns.

Employment Details

This will start as a 4-6 month contract engagement (20 hours/week) with a clear path to full-time employment as we finalize 2026 project scopes. We’ll jointly evaluate fit, scope, and structure during that period.
Optimal start date:
December 15, 2025