Skip to main content
Centific logo

AI/ML Research Scientist, LLM Post-Training & Evaluation

Centific
Full TimemidHybrid
Washington, District of Columbia, USPosted April 16, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonTensorFlowPyTorch

Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score

Job Description

Research Scientist, LLM Evaluation &

  • Post-Training

Company: Centific

Location: Palo Alto, CA or Seattle, WA (Hybrid/Remote)

Type: Full-time

Key Responsibilities

  • Research Agenda &
  • Experimentation: Define and execute a rigorous research agenda focused on LLM evaluation and post-training, with emphasis on evaluation-driven model improvement. Design experiments to study how evaluation methodologies impact fine-tuning and post-training outcomes.
  • Evaluation Framework Development: Develop and validate comprehensive evaluation frameworks for LLM and multimodal systems, covering benchmark and task design, scoring methods, judge/model-assisted evaluation, human evaluation protocols, and robustness/stress testing.
  • Advanced Evaluation Research: Lead research on frontier evaluation domains including long-context, cross-modal, and dynamic multi-turn evaluations. Study effectiveness and limitations of existing techniques and propose improved methodologies with clear validity and scalability tradeoffs.
  • Model Behavior Analysis: Analyze model behavior and failure patterns; generate actionable recommendations for model improvement and evaluation redesign. Translate findings into practical improvements for customer solutions and Centific's internal platforms.
  • Cross-Functional Collaboration: Partner with Language Data Scientists to integrate human-in-the-loop and synthetic data/evaluation strategies, and with AI/ML Research Engineers to translate research methods into scalable evaluation and post-training pipelines.
  • Customer Engagement: Engage with customer technical stakeholders at leading AI organizations to understand evaluation goals, review methodologies, and provide expert scientific recommendations. Serve as a credible technical peer to research and engineering leaders.
  • Knowledge &
  • IP Creation: Contribute to internal benchmark datasets, reusable evaluation frameworks, and research assets. Produce high-quality technical documentation, internal research reports, and client-facing materials explaining methods, results, assumptions, and limitations.
  • Thought Leadership: Contribute to Centific's position as a leader in LLM evaluation and post-training through publications, conference presentations, and open-source contributions.

Core Technical Competencies

You will provide technical depth and leadership across the following domains:

Evaluation Science &

  • Benchmarking
  • Expert-level benchmark dataset and test suite design for language and multimodal models
  • Deep understanding of metric design, scoring reliability, and measurement validity
  • Experience with human evaluation methods and quality assurance (rubric design, inter-rater reliability, adjudication frameworks)

LLM &

  • Post-Training Methods
  • Strong understanding of post-training techniques (SFT, RLHF, RLAIF, DPO, PPO, GRPO) and how training objectives interact with evaluation outcomes
  • Ability to reason about model behavior, failure modes, and performance tradeoffs across tasks and domains
  • Familiarity with alignment, safety, and robustness considerations in model evaluation

Quantitative Analysis &

  • Scientific Rigor
  • Strong statistical analysis skills: sampling, uncertainty quantification, significance testing, error analysis, metric interpretation
  • Ability to synthesize complex experimental findings into concise, actionable recommendations for engineering and business stakeholders

Required Qualifications

  • Education: MS or PhD in Computer Science, Machine Learning, Statistics, Applied Mathematics, AI, or a related quantitative field (PhD strongly preferred).
  • Research Experience: 5+ years of relevant experience in applied ML research or research science, with substantial work in LLMs or foundation models (graduate research counts).
  • LLM Evaluation Expertise: Demonstrated experience with LLM evaluation, benchmarking, alignment, post-training, or model quality research.
  • Experimental Design: Strong foundation in experimental design, statistical analysis, and scientific reasoning for ML systems.
  • Technical Proficiency: Strong Python coding skills for research experimentation, data processing, evaluation pipelines, statistical analysis, and visualization. Hands-on experience with modern ML frameworks (PyTorch, Hugging Face, JAX/TensorFlow).
  • Evaluation Methodology: Ability to evaluate and compare human and automated evaluation methods, including tradeoffs in cost, reliability, validity, and scalabili

Want AI-powered job matching?

Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.

Get Started Free