Role Overview
The Principal Simulation & Reliability Architect will lead the design of modular simulation environments, reliability tooling, and observability patterns that help teams understand and improve multi-step agentic AI workflows. This role is both architectural and hands-on: you will prototype internal tools, establish foundational patterns, and collaborate closely with the founder, data scientist, and synthetic data teams.
Responsibilities
- Design modular simulation environments for multi-step agent workflows and decision policies.
- Model interactions among agents, tools, and document flows to surface behavior and failure modes.
- Define evaluation patterns for agentic systems (task success, factuality, procedure adherence, suitability).
- Build regression, validation, and inspection tooling for simulation outputs.
- Identify and instrument key events and metrics for monitoring, triage, and investigation workflows.
- Integrate simulations with modern observability tooling (OpenTelemetry, Arize, Grafana).
- Develop trace schemas and system health signals to support reliability insights.
- Establish architectural patterns and internal frameworks for future engineering hires.
- Contribute to the roadmap and technical foundations of Reins AI’s simulation and reliability platform.
Qualifications
- 6+ years architecting or building complex ML, simulation, workflow, or observability systems.
- Strong Python engineering fundamentals and experience developing internal tooling or frameworks.
- Ability to design abstractions and end-to-end technical architectures.
- Familiarity with multi-step AI workflows or agentic patterns (any framework).
- Strong debugging intuition and systems-thinking mindset.
- Excellent communication skills and comfort working in a fast-moving, founder-led environment.
Preferred Skills
- Experience with simulation frameworks, synthetic data workflows, or agentic evaluation.
- Background in reliability engineering, monitoring, or triage system design.
- Exposure to regulated domains (audit, finance, healthcare).
- Knowledge of distributed systems or ML pipeline design.
- Experience with observability tooling (OpenTelemetry, Arize, Grafana, Datadog).
- Familiarity with agentic frameworks such as LangGraph, Semantic Kernel, or CrewAI.
Employment Details
This will start as a 4-6 month contract engagement (20 hours/week) with a clear path to full-time employment as we finalize 2026 project scopes. We’ll jointly evaluate fit, scope, and structure during that period.
Optimal start date:
December 19, 2025