Senior Machine Learning Engineer – Applied AI Research

Manulife

Full Timesenior

Waterloo, Ontario, CAPosted March 3, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

AWSGCPAzureDockerKubernetesTerraformTensorFlowPyTorchCI/CD

Job Description

As a Senior AI & Machine Learning Engineer in our Applied AI Research team, you will drive the technical direction for next-generation AI systems while owning the cloud infrastructure that powers them. This dual role spans the full stack—from architecting cost-efficient, high-performance cloud environments for AI workloads to designing and implementing Large Language Models (LLMs), agentic frameworks, and distilled Small Language Models (SLMs). You will lead the design of scalable infrastructure and intelligent systems alike, mentor the team, publish research findings, contribute to open source, and translate breakthrough AI capabilities into innovative production solutions.

Position Responsibilities:

Cloud Infrastructure & Platform Engineering

AI Infrastructure Architecture: Design, build, and maintain cloud infrastructure (Azure must, AWS is a plus) purpose-built for AI/ML workloads, including GPU clusters, training pipelines, and model serving platforms.
Cost Optimization: Continuously analyze and optimize cloud spend for AI workloads—implement cluster/instance strategies, right-size GPU allocations, manage reserved capacity, and establish FinOps practices to maximize performance per dollar.
Performance Engineering: Tune infrastructure for throughput and latency across training and inference workloads, including networking, storage I/O, and GPU utilization monitoring.
Platform Reliability: Ensure onboarding and optimizing for AI services through infrastructure-as-code (Terraform is plus , Pulumi), automated scaling, and robust monitoring/alerting pipelines.
MLOps & CI/CD: Build and maintain end-to-end MLOps pipelines for model training, evaluation, registry, and deployment, enabling rapid and reproducible experimentation-to-production workflows.
Security & Governance: Implement cloud security best practices for AI workloads, including data encryption, access controls, network isolation, and compliance with organizational policies for model and data governance.

AI/ML Engineering

Applied AI Leadership: Set the technical direction for the adoption of emerging AI capabilities, specifically focusing on LLMs, autonomous agents, and multi-modal systems.
Model Engineering: Lead the end-to-end lifecycle of high-performance models, including fine-tuning, distillation of large models into efficient Small Language Models (SLMs), and quantization for deployment.
Agentic Systems: Architect and build complex, reasoning-based AI agents using modern frameworks to solve open-ended business challenges.
Innovation & Research: Experiment with state-of-the-art (SOTA) techniques, publish technical papers, and contribute to the open-source community to elevate the organization's technical brand.
Mentorship: Mentor junior engineers and researchers on best practices in prompt engineering, model evaluation, distributed training, and cloud-native AI development.
Productionization: Bridge the gap between research and production by converting experimental prototypes into scalable, reliable AI services running on optimized cloud infrastructure.
Strategic Collaboration: Work with stakeholders to identify high-impact opportunities for disruptive AI, translating technical possibilities into strategic business outcomes.

Required Qualifications:

Significant hands-on experience with at least one major cloud platform (primarily Azure, AWS or GCP as plus), including compute, networking, storage, and IAM. Managing Databricks environment is a plus.
Demonstrated experience managing GPU-accelerated cloud environments (and optimizing their cost and performance.
Expert knowledge of modern AI frameworks and libraries (e.g., PyTorch, TensorFlow, Hugging Face Transformers, LangChain, LlamaIndex).
Proven experience in fine-tuning LLMs (e.g., LoRA, PEFT) and utilizing RAG (Retrieval-Augmented Generation) architectures.
Strong proficiency with infrastructure-as-code tools (Terraform, CloudFormation, or Pulumi) and container orchestration (Docker, Kubernetes).
Ability to drive a strategic vision regarding AI infrastructure, GPU optimization, cost management, and model evaluation pipelines.
Minimum Bachelor's degree in Computer Science, Math, or Engineering; Masters or PhD preferred for this research-focused role.

Preferred Qualifications

Deep technical understanding of Transformer architectures, attention mechanisms, and model distillation techniques.
Experience building agentic workflows and using vector databases.
Experience with Kubernetes-based ML platforms (Kubeflow, Ray, KServe/Triton Inference Server) for training and serving at scale.
Familiarity with FinOps tooling and practices for cloud cost governance across AI workloads.
Advanced knowledge in distributed computing and training large models across multi-GPU/node clusters.
Track record of publishing papers in top-tier conferences (NeurIPS, ICML, ICLR) or significant contributions to open-source AI projects.
Experience with observabilit

All jobs at Manulife →