Resume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Reddit is continuing to grow our teams with the best talent. This role is completely remote friendly within the United States. If you happen to live close to one of our physical office locations (San Francisco, Los Angeles, New York City & Chicago) our doors are open for you to come into the office as often as you'd like.
The AI Engineering team at Reddit is embarking on a strategic initiative to build our own Reddit-native foundational Large Language Models (LLMs). This team sits at the intersection of applied research and massive-scale infrastructure, tasked with training models that truly understand the unique culture, language, and structure of Reddit communities. You will be joining a team of distinguished engineers and safety experts to build the "engine room" of Reddit's AI future—creating the foundational models that will power Safety & Moderation, Search, Ads, and the next generation of user products.
As a Staff Research Engineer for Pre-training Data, you will define the technical strategy and architecture for the data curriculum pipelines that power our next-generation foundation models. Sitting at the intersection of distributed infrastructure, multimodal processing, and mathematics, you will design systems that transform Reddit’s unique corpus of human conversation—petabytes of text, images, and video—into high-quality training signals. You will move beyond flat text processing to engineer solutions that respect the complex, tree-structured nature of Reddit threads, ensuring our models learn the nuance of community interaction.
Responsibilities:
- Architect and implement high-throughput, deterministic data sampling systems capable of feeding distributed training clusters at frontier-model scale.
- Design and execute dynamic curriculum learning strategies, creating systems that automatically adjust data distributions (text vs. multimodal) during training to improve model stability and reasoning capabilities.
- Engineer the logic for serializing Reddit’s complex conversational trees (threads, subreddits, cross-posts) into optimal training contexts, developing topological data processing strategies that preserve semantic relationships for model understanding.
- Formulate and validate statistical hypotheses regarding data mixtures, leveraging advanced sampling theory to minimize bias and maximize token quality.
- Design the "Safety-First" ingestion layer: Build automated pipelines for PII redaction, toxicity signals, and quality deduplication upstream of training, working closely with our Safety and Moderation Engineering counterparts.
- Bridge the gap between research and engineering by translating theoretical sampling insights into robust, low-latency production infrastructure.
- Mentor senior engineers and researchers on system design, numerical correctness, and performance optimization within distributed Python/Rust environments.
Required Qualifications:
- 8+ years of software engineering experience with a focus on machine learning infrastructure, data science at scale, or LLM pre-training.
- Expert proficiency in Python and distributed data processing frameworks (e.g., Ray Data, Spark, or custom high-performance dataloaders).
Similar Jobs
Associate Web Developer - Fresher
Trivora Systems
Salesforce Developer - 100% Remote
VXForward
Senior Database Administrator (PostgreSQL / AWS RDS)
ASYVA INFOTECH
Technical Consultant - Linux and Azure Cloud Engineer job at AHEAD, Inc. in Gurgaon, HR, India
AHEAD, Inc.
Senior BI Testing Analyst
Trilogy Federal
More Jobs at Reddit
View all →Manager, Mid-Market Sales (Client Account Executives)
Machine Learning Engineer, Search and Answers
Machine Learning Engineer, Ads
Machine Learning Engineer, Ads
iOS Software Engineer, i18n: Grow Global and Local Communities
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free