Evaluation Implementation Engineer

Implement evaluators, pipelines, and dashboards that turn test designs into reliable, scalable monitoring for high-stakes audit/finance AI.

Role Overview

At Reins AI, Evaluation Implementation Engineers make evaluation real. You’ll turn Test Designer specs into production-grade evaluators and data pipelines; integrate with client stacks (Databricks, GCP/Azure); and wire outputs to observability platforms (LangSmith, OpenTelemetry, Arize) so product owners can trust what they ship. You’ll focus on correctness, reproducibility, and scale, treating evaluation as meaning-making, not just fault-finding.

Responsibilities

Implement evaluators and scoring logic (task success, factuality/hallucination, procedure-adherence via golden traces, suitability, reliability).

Build dataflows for ingestion, sampling, and batch/stream evaluation; persist results with strong schemas for query and audit.

Orchestrate runs and backfills (e.g., Prefect/Airflow), containerize jobs (Docker), and automate CI/CD.

Integrate traces and metrics with LangSmith/OpenInference & OpenTelemetry; publish monitoring to Arize or equivalent.

Optimize runtime (parallelism, caching, vectorization) and cost while preserving statistical power.

Implement evaluator reliability checks (human vs. automated agreement, calibration) and guard invalid/low-confidence scores.

Collaborate with Test Designers to refine specs; with Tech Lead to harden architecture; with Delivery to meet acceptance criteria.

Document interfaces, configs, and runbooks for repeatability and handoff.

‍

Qualifications

4+ years in software/data/ML engineering with shipped production systems.

Strong Python (pandas, PyArrow/Polars, typing, testing) and REST API integration.

Experience building pipelines/orchestration (Prefect/Airflow, dbt a plus) and working in cloud (GCP or Azure) with object stores/warehouses.

Familiarity with observability for LLM/agent systems (LangSmith/OpenInference, OpenTelemetry) and ML monitoring (e.g., Arize).

Solid applied stats for evaluation: sampling, confidence intervals, inter-rater agreement, calibration.

Habit of writing clean, testable code; comfort with Docker, CI/CD, infra-as-code basics.

‍

Preferred Skills

Domain exposure to audit/finance or other regulated environments.

Experience implementing golden-trace evaluators, synthetic data harnesses, or red-team/edge-case suites.

Familiarity with vector DBs, retrieval/RAG evaluation, and latency/throughput tuning.

Security-minded development (PII handling, least privilege, key management).

‍

Employment Details

12 months, 100% time blended

Optimal start date:

October 1, 2025

Have questions? Ready to apply?

Email us your resume and introduce yourself.

jobs@reinsai.com