Skip to main content
FreelanceJobs logo

AI Evaluation and QA Engineer

FreelanceJobs
CAPosted March 11, 2026

Job Description

We are building production AI agents for an enterprise client. These agents make real business decisions.

When they get something wrong, there are commercial consequences. We need someone who owns the quality layer.

You build test harnesses, red-team test cases, supervisor scoring logic, and release gates.

You work closely with the architect on the supervisor framework and with the engineers to make sure every agent meets acceptance criteria before it ships.

This is not traditional software QA. You are evaluating LLM-powered outputs for accuracy, policy compliance, tone, and completeness.

You are building the automated and human review loops that determine whether an agent's output is good enough to go live.

Responsibilities

  • Design and implement evaluation frameworks for AI agent outputs (accuracy, policy compliance, tone, completeness)
  • Build automated test harnesses that run on every deployment
  • Create redteam test cases: adversarial inputs, edge cases, policy boundary tests
  • Implement scoring logic inside the supervisor layer (confidence thresholds, pass/fail criteria)
  • Define and enforce release gates: what must pass before an agent goes live
  • Work with engineers to build rework and repair loops for failed outputs
  • Document evaluation results and quality trends for milestone reviews

Required:

5+ years in QA, testing, or evaluation roles for AI/ML systems (not traditional software QA only)

Experience evaluating LLM outputs:

hallucination detection, factual accuracy, policy compliance

Ability to write Python scripts for automated evaluation pipelines

Understanding of confidence scoring, thresholding, and human-in-the-loop review patterns

Experience with structured output validation and schema checking

Familiarity with red-teaming or adversarial testing for generative AI

Strong analytical and written communication skills

Nice to have

Experience with eval frameworks (Ragas, DeepEval, or custom evaluation suites)

Background in regulated or high-stakes domains (legal, financial, healthcare) where AI accuracy matters

Experience building audit trails and compliance reporting for AI systems

Familiarity with agent-specific failure modes (tool-calling errors, state corruption, cascading failures)

Screening question:

How would you design an evaluation framework for an AI agent that generates structured business documents (proposals, reports, or summaries) from multiple data sources? What metrics would you track, how would you catch factual errors or hallucinated content, and at what point would you escalate to a human reviewer?

Contract duration of 3 to 6 months. with 40 hours per week.

Mandatory skills:

Artificial Intelligence, Software Testing, Bug Reports

Want AI-powered job matching?

Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.

Get Started Free