Evaluation Implementation Engineer

Implement evaluators, pipelines, and dashboards that turn test designs into reliable, scalable monitoring for high-stakes audit/finance AI.

Role Overview

At Reins AI, Evaluation Implementation Engineers make evaluation real. You’ll turn Test Designer specs into production-grade evaluators and data pipelines; integrate with client stacks (Databricks, GCP/Azure); and wire outputs to observability platforms (LangSmith, OpenTelemetry, Arize) so product owners can trust what they ship. You’ll focus on correctness, reproducibility, and scale, treating evaluation as meaning-making, not just fault-finding.

Responsibilities

  • Implement evaluators and scoring logic (task success, factuality/hallucination, procedure-adherence via golden traces, suitability, reliability).
  • Build dataflows for ingestion, sampling, and batch/stream evaluation; persist results with strong schemas for query and audit.
  • Orchestrate runs and backfills (e.g., Prefect/Airflow), containerize jobs (Docker), and automate CI/CD.
  • Integrate traces and metrics with LangSmith/OpenInference & OpenTelemetry; publish monitoring to Arize or equivalent.
  • Optimize runtime (parallelism, caching, vectorization) and cost while preserving statistical power.
  • Implement evaluator reliability checks (human vs. automated agreement, calibration) and guard invalid/low-confidence scores.
  • Collaborate with Test Designers to refine specs; with Tech Lead to harden architecture; with Delivery to meet acceptance criteria.
  • Document interfaces, configs, and runbooks for repeatability and handoff.
  • Qualifications

  • 4+ years in software/data/ML engineering with shipped production systems.
  • Strong Python (pandas, PyArrow/Polars, typing, testing) and REST API integration.
  • Experience building pipelines/orchestration (Prefect/Airflow, dbt a plus) and working in cloud (GCP or Azure) with object stores/warehouses.
  • Familiarity with observability for LLM/agent systems (LangSmith/OpenInference, OpenTelemetry) and ML monitoring (e.g., Arize).
  • Solid applied stats for evaluation: sampling, confidence intervals, inter-rater agreement, calibration.
  • Habit of writing clean, testable code; comfort with Docker, CI/CD, infra-as-code basics.
  • Preferred Skills

  • Domain exposure to audit/finance or other regulated environments.
  • Experience implementing golden-trace evaluators, synthetic data harnesses, or red-team/edge-case suites.
  • Familiarity with vector DBs, retrieval/RAG evaluation, and latency/throughput tuning.
  • Security-minded development (PII handling, least privilege, key management).
  • 12 months, 100% time blended
    October 1, 2025