Test Designer

Design evaluation scenarios, datasets, and metrics that reveal real risks in regulated industry AI.

Role Overview

At Reins AI, Test Designers define what we measure and how we know systems are under control. You’ll craft evaluation scenarios for agentic and human‑in‑the‑loop workflows in regulated audit contexts, specify metrics (quality, suitability, reliability), and design edge‑case suites that surface risks before they hit production. You’ll translate domain requirements into structured test sets, golden traces, and scoring rubrics that our engineers can automate and scale.

Responsibilities

  • Partner with clients and domain SMEs to capture real‑world tasks and controls (audit procedures, review flows, sampling rules).
  • Design evaluation artifacts: scenarios, prompts, datasets, golden traces, answer keys, and error taxonomies.
  • Specify metric definitions and scoring logic (e.g., task success, adherence to procedure, factuality/hallucination controls, suitability).
  • Plan data strategies: sampling methods, synthetic data needs, and statistical power considerations.
  • Define edge‑case and stress tests (distribution shifts, ambiguity, incomplete docs, adversarial inputs).
  • Collaborate with Evaluation Implementation Engineers to operationalize tests into pipelines and dashboards.
  • Validate evaluator reliability against human judgments; iterate with clear pass/fail thresholds and acceptance criteria.
  • Document designs with unambiguous instructions, schemas, and expected behaviors for reproducibility.
  • Qualifications

  • 4+ years in evaluation/test design, applied research, QA for ML/AI, or adjacent roles.
  • Strong ability to turn domain goals into measurable test plans and rubrics.
  • Comfort with basic statistics for evaluation (sampling, confidence, inter‑rater agreement).
  • Excellent written communication—clear specs others can implement without guesswork.
  • Experience collaborating cross‑functionally with engineers and product/delivery.
  • Preferred Skills

  • Background in audit/finance or other regulated domains.
  • Familiarity with agentic systems, human‑in‑the‑loop workflows, and prompt/program design.
  • Exposure to synthetic data generation and/or golden‑trace authoring.
  • Ability to design evaluator reliability checks (e.g., human vs. automated agreement studies).
  • 12 months, 50% time blended
    October 1, 2025