Skip to main content
FreelanceJobs logo

Build Reliable LLM Generation + Validation Pipeline

FreelanceJobs
CAPosted March 7, 2026

Job Description

We have an early working V3 of an AI-powered coding challenge ("mission") generator.

The system generates coding missions, starter code, and tests using LLM agents, and validates them with Judge0 to ensure starter code fails all tests and missions are executable/self-contained.

Supabase/Postgres stores missions and user profiles.

We are looking for a freelancer with experience in LLM systems, AI pipelines, or backend architecture to either:

  • Improve and productionize the existing codebase, or
  • Build a new, robust generation pipeline from scratch
  • The system must support both technical and nontechnical users:
  • Technical users: receive missions with starter code and tests in any programming language; evaluated via our self-hosted Judge0 instance
  • Nontechnical users: receive multiplechoice or textual questions only; AI handles grading automatically
  • All functionality must be accessible via API endpoints, including mission/question generation and grading.

Key Project Deliverables

  • Generation Pipeline – Refactored or new pipeline with clear interfaces, modular design, and APIfirst endpoints
  • Validation Gates – Starter code must fail all tests reliably using static and runtime checks
  • Deduplication System – Prevent exact or semantic duplicates per user (works for code and questions)
  • Job Orchestration – Retry, revision, timeouts, and optional mission pool for async generation to reduce latency
  • Observability & Metrics – Structured logs, latency breakdown, token/cost tracking, failure diagnostics
  • AI Grading Layer – Automatically grade nontechnical users' submissions
  • Support Any Language – For technical missions; system must integrate with Judge0 or equivalent sandboxed execution
  • Documentation & Runbooks – Short guides explaining how to run, monitor, and extend the system

Endpoints (API Design) – Suggested endpoints include:

  • /generate-mission → technical mission generation
  • /grade-submission → code grading via Judge0
  • /generate-assessment → non-technical questions only
  • /grade-assessment → AI grading of answers
  • /mission-pool-status → optional async pool info
  • /metrics → observability data

Engagement Details

  • Start: ASAP
  • Duration: 4 days
  • Hours: 4–6 hrs/day
  • Budget: $15–$20/hr
  • Level: Intermediate

To Apply

Please include:

  • 2–3 relevant projects (LLM pipelines, multiagent systems, or evaluation systems)
  • Your recommended approach or architecture (1–2 paragraphs)
  • How you would enforce "starter must fail all tests" reliably
  • How you would handle duplicate or nearduplicate detection
  • How you would implement grading for nontechnical users
  • How you would structure API endpoints for generation and grading
  • Your availability and rate

Contract duration of 1 to 3 months. with 30 hours per week.

Mandatory skills:

LLM Prompt Engineering, AI Agent Development, Generative AI, AI Model Integration

Want AI-powered job matching?

Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.

Get Started Free