We have an early working V3 of an AI-powered coding challenge ("mission") generator.

The system generates coding missions, starter code, and tests using LLM agents, and validates them with Judge0 to ensure starter code fails all tests and missions are executable/self-contained.

Supabase/Postgres stores missions and user profiles.

We are looking for a freelancer with experience in LLM systems, AI pipelines, or backend architecture to either:

Improve and productionize the existing codebase, or
Build a new, robust generation pipeline from scratch
The system must support both technical and nontechnical users:
Technical users: receive missions with starter code and tests in any programming language; evaluated via our self-hosted Judge0 instance
Nontechnical users: receive multiplechoice or textual questions only; AI handles grading automatically
All functionality must be accessible via API endpoints, including mission/question generation and grading.

Key Project Deliverables

Generation Pipeline – Refactored or new pipeline with clear interfaces, modular design, and APIfirst endpoints
Validation Gates – Starter code must fail all tests reliably using static and runtime checks
Deduplication System – Prevent exact or semantic duplicates per user (works for code and questions)
Job Orchestration – Retry, revision, timeouts, and optional mission pool for async generation to reduce latency
Observability & Metrics – Structured logs, latency breakdown, token/cost tracking, failure diagnostics
AI Grading Layer – Automatically grade nontechnical users' submissions
Support Any Language – For technical missions; system must integrate with Judge0 or equivalent sandboxed execution
Documentation & Runbooks – Short guides explaining how to run, monitor, and extend the system

Endpoints (API Design) – Suggested endpoints include:

/generate-mission → technical mission generation
/grade-submission → code grading via Judge0
/generate-assessment → non-technical questions only
/grade-assessment → AI grading of answers
/mission-pool-status → optional async pool info
/metrics → observability data

Engagement Details

Start: ASAP
Duration: 4 days
Hours: 4–6 hrs/day
Budget: $15–$20/hr
Level: Intermediate

To Apply

Please include:

2–3 relevant projects (LLM pipelines, multiagent systems, or evaluation systems)
Your recommended approach or architecture (1–2 paragraphs)
How you would enforce "starter must fail all tests" reliably
How you would handle duplicate or nearduplicate detection
How you would implement grading for nontechnical users
How you would structure API endpoints for generation and grading
Your availability and rate

Contract duration of 1 to 3 months. with 30 hours per week.

Mandatory skills:

LLM Prompt Engineering, AI Agent Development, Generative AI, AI Model Integration

Build Reliable LLM Generation + Validation Pipeline

Job Description

More Jobs at FreelanceJobs

Want AI-powered job matching?

More Jobs at FreelanceJobs