Reliability Data Scientist

Design evaluation scenarios, datasets, and metrics that reveal real risks in regulated industry AI.

Role Overview

At Reins AI, data scientists define and operationalize how we measure reliability in real-world AI systems. You’ll bridge evaluation design and data analysis, crafting the test logic behind our reliability dashboards and weekly reports. Working across regulated audit and finance contexts, you’ll translate evaluation scenarios into structured metrics, visualizations, and summaries that help our clients see what’s working, what’s drifting, and what needs triage.

You’ll collaborate closely with our Solutions Architect and Reliability Lead to connect monitoring data (Grafana, LangSmith, Arize) with simulations and context-engineering workflows, building the analytical backbone of AI Ops reporting.

Responsibilities

  • Partner with domain and monitoring leads to define evaluation scenarios and metrics (quality, suitability, reliability).
  • Build and maintain evaluation datasets, golden traces, and error taxonomies.
  • Develop and maintain weekly reliability dashboards and summary reports (Grafana, Python, SQL, or notebooks).
  • Analyze evaluation results for drift, outliers, and context-dependent failures; flag issues for triage and verification loops.
  • Collaborate with engineers to automate scoring and aggregation pipelines.
  • Validate evaluator reliability and calibration against human judgments.
  • Document test logic, metric definitions, and interpretation guidance for repeatability.
  • Support context-engineering workflows by designing metrics that measure predictability, observability, and directability.

Qualifications

  • 3-6 years in data science, analytics, or ML evaluation roles.
  • Experience building dashboards and automated reports (Grafana, PowerBI, or similar).
  • Strong Python, SQL, and data-wrangling skills.
  • Familiarity with evaluation design concepts (sampling, calibration, pass/fail criteria).
  • Excellent communication: can turn technical data into clear, decision-ready insights.

Preferred Skills

  • Background in AI system monitoring, LLM evaluation, or reliability engineering.
  • Familiarity with LangSmith, OpenInference, or similar tracing frameworks.
  • Experience with synthetic or simulated data analysis.
  • Understanding of regulated domains (audit, finance, healthcare).

Employment Details

This will start as a 4–6 month contract engagement (20 hours/week) with a clear path to full-time employment as we finalize 2026 project scopes. We’ll jointly evaluate fit, scope, and structure during that period.
Optimal start date:
December 15, 2025