The AI and ML Infrastructure team is part of Slack’s Core Infrastructure organization and is responsible for the foundational systems that enable machine learning and AI across the company. The team designs, builds, and operates reliable, scalable, and high performance platforms that allow product and ML teams to develop, deploy, and operate AI driven capabilities with confidence
The team owns shared infrastructure, services, and tooling that support the full ML lifecycle, including model training, deployment, inference, and monitoring. As Slack AI continues to grow, the team is evolving from traditional ML deployments toward large scale, highly distributed systems
This work involves deep architectural decisions around scalable model deployment strategies, real time feature serving at very high throughput, GPU accelerated inference at message scale, and responsible training of models on sensitive data with strong privacy and safety requirements
The ML Infrastructure focus area is responsible for the low level systems that power training and inference at scale. This includes architecting and maintaining distributed systems for model training, serving, and deployment using Kubernetes based platforms, GPU infrastructure, and open source ML stacks such as KubeRay and vLLM
The team delivers platform capabilities that improve the speed, reliability, and quality of ML development, including training pipelines, feature generation systems, and compute orchestration
We are looking for Software Engineers to join the ML Infrastructure focus area and help architect and operate the core systems that power AI at Slack. In this role, you will own foundational infrastructure for large scale model training and inference, and evolve it into a reliable, secure, and self service platform used across the company
You will work at the intersection of distributed systems, GPU infrastructure, and modern ML stacks, solving complex scalability and reliability challenges. This role blends deep systems engineering with a strong understanding of the ML lifecycle, and plays a critical part in shaping the long term technical foundations of Slack’s AI capabilities
Design, build, and operate systems to train, serve, and deploy machine learning models at scale, with a focus on reliability, performance, and operational simplicity
Evolve GPU backed inference infrastructure to support high throughput, latency sensitive workloads, including large scale model serving
Architect and optimize distributed training and data processing systems using platforms such as Ray, Airflow, Spark, or similar technologies
Build and maintain Kubernetes based platforms and orchestration layers using tools such as KubeRay, vLLM, and internally developed services
Architect solutions that bridge legacy systems with modern technologies while maintaining monolithic application stability
Develop robust monitoring, observability, and alerting for production ML workloads to ensure operational excellence
Partner closely with AI Platform, ML modeling, security, and product engineering teams to design infrastructure that supports evolving AI use cases
Provide technical leadership through design reviews, mentorship, and by setting engineering standards and long term architectural direction for ML infrastructure
Author technical design and architecture documentation, and contribute thought leadership through engineering blog posts

Benefits

Medical Care
Life Insurance
Retirement Savings
Employee Assistance Programs
With 9 standard holidays and four floating holidays, you get a total 13 paid days off each year- Experience building and operating cloud native systems on public cloud platforms such as AWS, GCP, or Azure, including infrastructure as code
Experience working with GPU infrastructure, including performance optimization and operational management at scale
Significant professional experience in software engineering with a strong focus on infrastructure, backend systems, platform engineering, or MLOps
A demonstrated ability to drive technical direction for complex systems and balance short term delivery with long term architectural goals
Deep experience building and operating distributed systems, including expert level knowledge of Kubernetes and container based platforms
Hands on experience with modern ML infrastructure and serving stacks such as Ray or KubeRay, vLLM, or similar training and inference orchestration frameworks
A related technical degree required
Excellent written communication, as well as ability to thrive in an asynchronous and globall

Software Engineer (Machine Learning Infrastructure, Slack)

Salesforce

Full Timemid

Washington, District of Columbia, US$149k – $314kPosted 6 days ago

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

AWSGCPAzureKubernetesSparkAirflow

Job Description

The AI and ML Infrastructure team is part of Slack’s Core Infrastructure organization and is responsible for the foundational systems that enable machine learning and AI across the company. The team designs, builds, and operates reliable, scalable, and high performance platforms that allow product and ML teams to develop, deploy, and operate AI driven capabilities with confidence
The team owns shared infrastructure, services, and tooling that support the full ML lifecycle, including model training, deployment, inference, and monitoring. As Slack AI continues to grow, the team is evolving from traditional ML deployments toward large scale, highly distributed systems
This work involves deep architectural decisions around scalable model deployment strategies, real time feature serving at very high throughput, GPU accelerated inference at message scale, and responsible training of models on sensitive data with strong privacy and safety requirements
The ML Infrastructure focus area is responsible for the low level systems that power training and inference at scale. This includes architecting and maintaining distributed systems for model training, serving, and deployment using Kubernetes based platforms, GPU infrastructure, and open source ML stacks such as KubeRay and vLLM
The team delivers platform capabilities that improve the speed, reliability, and quality of ML development, including training pipelines, feature generation systems, and compute orchestration
We are looking for Software Engineers to join the ML Infrastructure focus area and help architect and operate the core systems that power AI at Slack. In this role, you will own foundational infrastructure for large scale model training and inference, and evolve it into a reliable, secure, and self service platform used across the company
You will work at the intersection of distributed systems, GPU infrastructure, and modern ML stacks, solving complex scalability and reliability challenges. This role blends deep systems engineering with a strong understanding of the ML lifecycle, and plays a critical part in shaping the long term technical foundations of Slack’s AI capabilities
Design, build, and operate systems to train, serve, and deploy machine learning models at scale, with a focus on reliability, performance, and operational simplicity
Evolve GPU backed inference infrastructure to support high throughput, latency sensitive workloads, including large scale model serving
Architect and optimize distributed training and data processing systems using platforms such as Ray, Airflow, Spark, or similar technologies
Build and maintain Kubernetes based platforms and orchestration layers using tools such as KubeRay, vLLM, and internally developed services
Architect solutions that bridge legacy systems with modern technologies while maintaining monolithic application stability
Develop robust monitoring, observability, and alerting for production ML workloads to ensure operational excellence
Partner closely with AI Platform, ML modeling, security, and product engineering teams to design infrastructure that supports evolving AI use cases
Provide technical leadership through design reviews, mentorship, and by setting engineering standards and long term architectural direction for ML infrastructure
Author technical design and architecture documentation, and contribute thought leadership through engineering blog posts

Benefits

Medical Care
Life Insurance
Retirement Savings
Employee Assistance Programs
With 9 standard holidays and four floating holidays, you get a total 13 paid days off each year- Experience building and operating cloud native systems on public cloud platforms such as AWS, GCP, or Azure, including infrastructure as code
Experience working with GPU infrastructure, including performance optimization and operational management at scale
Significant professional experience in software engineering with a strong focus on infrastructure, backend systems, platform engineering, or MLOps
A demonstrated ability to drive technical direction for complex systems and balance short term delivery with long term architectural goals
Deep experience building and operating distributed systems, including expert level knowledge of Kubernetes and container based platforms
Hands on experience with modern ML infrastructure and serving stacks such as Ray or KubeRay, vLLM, or similar training and inference orchestration frameworks
A related technical degree required
Excellent written communication, as well as ability to thrive in an asynchronous and globall