Job Description
We are building production AI agents for an enterprise client. These agents make real business decisions.
When they get something wrong, there are commercial consequences. We need someone who owns the quality layer.
You build test harnesses, red-team test cases, supervisor scoring logic, and release gates.
You work closely with the architect on the supervisor framework and with the engineers to make sure every agent meets acceptance criteria before it ships.
This is not traditional software QA. You are evaluating LLM-powered outputs for accuracy, policy compliance, tone, and completeness.
You are building the automated and human review loops that determine whether an agent's output is good enough to go live.
Responsibilities
- Design and implement evaluation frameworks for AI agent outputs (accuracy, policy compliance, tone, completeness)
- Build automated test harnesses that run on every deployment
- Create redteam test cases: adversarial inputs, edge cases, policy boundary tests
- Implement scoring logic inside the supervisor layer (confidence thresholds, pass/fail criteria)
- Define and enforce release gates: what must pass before an agent goes live
- Work with engineers to build rework and repair loops for failed outputs
- Document evaluation results and quality trends for milestone reviews
Required:
5+ years in QA, testing, or evaluation roles for AI/ML systems (not traditional software QA only)
Experience evaluating LLM outputs:
hallucination detection, factual accuracy, policy compliance
Ability to write Python scripts for automated evaluation pipelines
Understanding of confidence scoring, thresholding, and human-in-the-loop review patterns
Experience with structured output validation and schema checking
Familiarity with red-teaming or adversarial testing for generative AI
Strong analytical and written communication skills
Nice to have
Experience with eval frameworks (Ragas, DeepEval, or custom evaluation suites)
Background in regulated or high-stakes domains (legal, financial, healthcare) where AI accuracy matters
Experience building audit trails and compliance reporting for AI systems
Familiarity with agent-specific failure modes (tool-calling errors, state corruption, cascading failures)
Screening question:
How would you design an evaluation framework for an AI agent that generates structured business documents (proposals, reports, or summaries) from multiple data sources? What metrics would you track, how would you catch factual errors or hallucinated content, and at what point would you escalate to a human reviewer?
Contract duration of 3 to 6 months. with 40 hours per week.
Mandatory skills:
Artificial Intelligence, Software Testing, Bug Reports
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free