Principal Simulation & Reliability Architect

Architect simulation, evaluation, and reliability systems (from frameworks to workflows to tooling) so AI teams can model, test, and operate complex agentic architectures reliably at scale.

Role Overview

The Principal Simulation & Reliability Architect will lead the design of modular simulation environments, reliability tooling, and observability patterns that help teams understand and improve multi-step agentic AI workflows. This role is both architectural and hands-on: you will prototype internal tools, establish foundational patterns, and collaborate closely with the founder, data scientist, and synthetic data teams.

Responsibilities

  • Design modular simulation environments for multi-step agent workflows and decision policies.
  • Model interactions among agents, tools, and document flows to surface behavior and failure modes.
  • Define evaluation patterns for agentic systems (task success, factuality, procedure adherence, suitability).
  • Build regression, validation, and inspection tooling for simulation outputs.
  • Identify and instrument key events and metrics for monitoring, triage, and investigation workflows.
  • Integrate simulations with modern observability tooling (OpenTelemetry, Arize, Grafana).
  • Develop trace schemas and system health signals to support reliability insights.
  • Establish architectural patterns and internal frameworks for future engineering hires.
  • Contribute to the roadmap and technical foundations of Reins AI’s simulation and reliability platform.

Qualifications

  • 6+ years architecting or building complex ML, simulation, workflow, or observability systems.
  • Strong Python engineering fundamentals and experience developing internal tooling or frameworks.
  • Ability to design abstractions and end-to-end technical architectures.
  • Familiarity with multi-step AI workflows or agentic patterns (any framework).
  • Strong debugging intuition and systems-thinking mindset.
  • Excellent communication skills and comfort working in a fast-moving, founder-led environment.

Preferred Skills

  • Experience with simulation frameworks, synthetic data workflows, or agentic evaluation.
  • Background in reliability engineering, monitoring, or triage system design.
  • Exposure to regulated domains (audit, finance, healthcare).
  • Knowledge of distributed systems or ML pipeline design.
  • Experience with observability tooling (OpenTelemetry, Arize, Grafana, Datadog).
  • Familiarity with agentic frameworks such as LangGraph, Semantic Kernel, or CrewAI.

Employment Details

This will start as a 4-6 month contract engagement (20 hours/week) with a clear path to full-time employment as we finalize 2026 project scopes. We’ll jointly evaluate fit, scope, and structure during that period.
Optimal start date:
December 19, 2025