Job Description
We have an early working V3 of an AI-powered coding challenge ("mission") generator.
The system generates coding missions, starter code, and tests using LLM agents, and validates them with Judge0 to ensure starter code fails all tests and missions are executable/self-contained.
Supabase/Postgres stores missions and user profiles.
We are looking for a freelancer with experience in LLM systems, AI pipelines, or backend architecture to either:
- Improve and productionize the existing codebase, or
- Build a new, robust generation pipeline from scratch
- The system must support both technical and nontechnical users:
- Technical users: receive missions with starter code and tests in any programming language; evaluated via our self-hosted Judge0 instance
- Nontechnical users: receive multiplechoice or textual questions only; AI handles grading automatically
- All functionality must be accessible via API endpoints, including mission/question generation and grading.
Key Project Deliverables
- Generation Pipeline – Refactored or new pipeline with clear interfaces, modular design, and APIfirst endpoints
- Validation Gates – Starter code must fail all tests reliably using static and runtime checks
- Deduplication System – Prevent exact or semantic duplicates per user (works for code and questions)
- Job Orchestration – Retry, revision, timeouts, and optional mission pool for async generation to reduce latency
- Observability & Metrics – Structured logs, latency breakdown, token/cost tracking, failure diagnostics
- AI Grading Layer – Automatically grade nontechnical users' submissions
- Support Any Language – For technical missions; system must integrate with Judge0 or equivalent sandboxed execution
- Documentation & Runbooks – Short guides explaining how to run, monitor, and extend the system
Endpoints (API Design) – Suggested endpoints include:
- /generate-mission → technical mission generation
- /grade-submission → code grading via Judge0
- /generate-assessment → non-technical questions only
- /grade-assessment → AI grading of answers
- /mission-pool-status → optional async pool info
- /metrics → observability data
Engagement Details
- Start: ASAP
- Duration: 4 days
- Hours: 4–6 hrs/day
- Budget: $15–$20/hr
- Level: Intermediate
To Apply
Please include:
- 2–3 relevant projects (LLM pipelines, multiagent systems, or evaluation systems)
- Your recommended approach or architecture (1–2 paragraphs)
- How you would enforce "starter must fail all tests" reliably
- How you would handle duplicate or nearduplicate detection
- How you would implement grading for nontechnical users
- How you would structure API endpoints for generation and grading
- Your availability and rate
Contract duration of 1 to 3 months. with 30 hours per week.
Mandatory skills:
LLM Prompt Engineering, AI Agent Development, Generative AI, AI Model Integration
More Jobs at FreelanceJobs
View all →Data analyst/BI expert to integrate multiple data sources into a single dashboard in power bi
FreelanceJobs
Business Intelligence Expert for Power BI, Tableau, SQL, and Cloud Data Solutions
FreelanceJobs
Data analyst/BI expert to integrate multiple data sources into a single dashboard in power bi
FreelanceJobs
Migration from G Suite 2 office 365
FreelanceJobs
Software Engineer Needed for Dynamic Projects
FreelanceJobs
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free