Resume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
PythonTensorFlowPyTorch
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Research Scientist, LLM Evaluation &
- Post-Training
Company: Centific
Location: Palo Alto, CA or Seattle, WA (Hybrid/Remote)
Type: Full-time
Key Responsibilities
- Research Agenda &
- Experimentation: Define and execute a rigorous research agenda focused on LLM evaluation and post-training, with emphasis on evaluation-driven model improvement. Design experiments to study how evaluation methodologies impact fine-tuning and post-training outcomes.
- Evaluation Framework Development: Develop and validate comprehensive evaluation frameworks for LLM and multimodal systems, covering benchmark and task design, scoring methods, judge/model-assisted evaluation, human evaluation protocols, and robustness/stress testing.
- Advanced Evaluation Research: Lead research on frontier evaluation domains including long-context, cross-modal, and dynamic multi-turn evaluations. Study effectiveness and limitations of existing techniques and propose improved methodologies with clear validity and scalability tradeoffs.
- Model Behavior Analysis: Analyze model behavior and failure patterns; generate actionable recommendations for model improvement and evaluation redesign. Translate findings into practical improvements for customer solutions and Centific's internal platforms.
- Cross-Functional Collaboration: Partner with Language Data Scientists to integrate human-in-the-loop and synthetic data/evaluation strategies, and with AI/ML Research Engineers to translate research methods into scalable evaluation and post-training pipelines.
- Customer Engagement: Engage with customer technical stakeholders at leading AI organizations to understand evaluation goals, review methodologies, and provide expert scientific recommendations. Serve as a credible technical peer to research and engineering leaders.
- Knowledge &
- IP Creation: Contribute to internal benchmark datasets, reusable evaluation frameworks, and research assets. Produce high-quality technical documentation, internal research reports, and client-facing materials explaining methods, results, assumptions, and limitations.
- Thought Leadership: Contribute to Centific's position as a leader in LLM evaluation and post-training through publications, conference presentations, and open-source contributions.
Core Technical Competencies
You will provide technical depth and leadership across the following domains:
Evaluation Science &
- Benchmarking
- Expert-level benchmark dataset and test suite design for language and multimodal models
- Deep understanding of metric design, scoring reliability, and measurement validity
- Experience with human evaluation methods and quality assurance (rubric design, inter-rater reliability, adjudication frameworks)
LLM &
- Post-Training Methods
- Strong understanding of post-training techniques (SFT, RLHF, RLAIF, DPO, PPO, GRPO) and how training objectives interact with evaluation outcomes
- Ability to reason about model behavior, failure modes, and performance tradeoffs across tasks and domains
- Familiarity with alignment, safety, and robustness considerations in model evaluation
Quantitative Analysis &
- Scientific Rigor
- Strong statistical analysis skills: sampling, uncertainty quantification, significance testing, error analysis, metric interpretation
- Ability to synthesize complex experimental findings into concise, actionable recommendations for engineering and business stakeholders
Required Qualifications
- Education: MS or PhD in Computer Science, Machine Learning, Statistics, Applied Mathematics, AI, or a related quantitative field (PhD strongly preferred).
- Research Experience: 5+ years of relevant experience in applied ML research or research science, with substantial work in LLMs or foundation models (graduate research counts).
- LLM Evaluation Expertise: Demonstrated experience with LLM evaluation, benchmarking, alignment, post-training, or model quality research.
- Experimental Design: Strong foundation in experimental design, statistical analysis, and scientific reasoning for ML systems.
- Technical Proficiency: Strong Python coding skills for research experimentation, data processing, evaluation pipelines, statistical analysis, and visualization. Hands-on experience with modern ML frameworks (PyTorch, Hugging Face, JAX/TensorFlow).
- Evaluation Methodology: Ability to evaluate and compare human and automated evaluation methods, including tradeoffs in cost, reliability, validity, and scalabili
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free