Evaluation & Monitoring System Architect

Architect evaluation & monitoring systems (from frameworks to workflows to tooling) so AI teams can operate them reliably at scale.

Role Overview

Reins AI is hiring an Evaluation & Monitoring Architect to design the operating model for AI evaluation in regulated audit/finance. You’ll define the end-to-end architecture (evaluator frameworks, data flows, observability integrations, scorecards, release gates), and then package it into runbooks, dashboards, and training so other teams can run it. You’ll partner with our Adaptation lead to ensure every improvement ships with regression coverage and measurable reliability gains.

Responsibilities

  • System Architecture: Define the evaluation/monitoring reference architecture (ingest → evaluate → score → report), schemas, and interfaces.
  • Evaluator Frameworks: Standardize evaluator patterns (task success, hallucination/factuality, procedure adherence via golden traces, suitability/reliability) with clear APIs and tests.
  • Observability Integration: Integrate traces/metrics (LangSmith/OpenInference, OpenTelemetry) and monitoring (e.g., Arize) with dashboards and SLOs.
  • Scorecards & Gates: Establish reliability KPIs (pass-rates, variance, MTTR, calibration) and “ready-to-ship” gates; automate backtests and regressions.
  • Workflows & Handoffs: Design triage queues, escalation paths, and ownership models so Delivery/Client teams can operate independently.
  • Governance: Define test-set stewardship (golden traces, thresholds, update cadence), versioning, change logs, and audit trails.
  • Enablement: Produce playbooks, runbooks, and quick-start guides; deliver internal/client training.
  • Partnerships: Work with Test Designers (what to measure), Implementation Engineers (how it runs), and Adaptation (what to change) to close the loop.
  • Qualifications

  • 6+ years in ML/AI evaluation, data/ML platform, or reliability/observability architecture.
  • Strong Python + data engineering fundamentals; comfort with cloud (GCP/Azure), containers, CI/CD.
  • Experience with tracing/metrics/monitoring for LLM/agent systems (e.g., LangSmith/OpenInference, OpenTelemetry, Arize).
  • Applied statistics for evaluation (sampling, confidence intervals, inter-rater agreement, calibration).
  • Excellent systems thinking, documentation, and stakeholder communication.
  • Preferred Skills

  • Audit/finance or other highly regulated domain experience.
  • Familiarity with agentic/HITL workflows, RAG evaluation, and synthetic data needs.
  • Security/privacy awareness (PII handling, access controls).
  • October 1, 2025