About the Job

At XP Venture Labs, we partner with ambitious companies to solve complex technology challenges and accelerate growth. Our teams are composed of highly skilled engineers, architects, and technology leaders who bring deep technical expertise and real-world delivery experience. We don’t operate as traditional consultants. We embed as strategic partners to aid in designing scalable systems, modernizing platforms, improving reliability, and helping our clients navigate high-impact technical decisions with confidence.

From cloud architecture and distributed systems to platform engineering and large-scale modernization initiatives, we specialize in solving the kinds of problems that demand precision, experience, and a relentless focus on outcomes.

About the Role

The Senior Site Reliability Engineer (SRE) Team Lead plays a critical role in ensuring the reliability, scalability, performance, and security of our production systems. This position blends deep technical expertise with hands-on leadership, operational excellence, and close cross-functional collaboration across Engineering and DevOps. The ideal candidate fosters a strong culture of reliability, designs resilient and scalable infrastructure, and mentors a high-performing SRE team that proactively prevents issues rather than simply reacting to them.

This is a high-impact leadership opportunity at a pivotal stage in our growth. As we scale our platform to support major company-wide initiatives, you will help shape the future of our reliability strategy. You’ll have the autonomy to evaluate and introduce new tools and technologies, establish best practices, and define the standards and operational procedures that will guide the SRE function moving forward.

Key Responsibilities

As a Senior Site Reliability Engineer (SRE) Team Lead, you’ll be leading critical initiatives that shape our platform's architecture, performance and scalability. Here’s what you’ll be responsible for:

Owning the reliability, availability, and performance of production systems.
Define and manage SLAs, SLOs, and SLIs as well as error budgets to drive measurable reliability improvements.
Build and evolve monitoring, logging, and observability standards and metrics.
Lead incident response, postmortems, and root cause analysis to reduce recurrence and mean time to recovery (MTTR).
Architect and maintain scalable, highly available cloud infrastructure.
Champion Infrastructure-as-Code, automation, and CI/CD best practices.
Build and evolve monitoring, logging, and observability standards.
Establish proactive capacity planning and performance optimization strategies.
Mentor and develop a high-performing SRE team; set on-call and operational excellence standards.
Partner with Engineering, DevOps, Security, and Product teams to embed reliability into the software development life cycle (SDLC).
Evaluate and implement new tools, technologies, and operational frameworks to improve platform resilience and efficiency.

Technical Requirements

For this role, you must have the following technical skills:

Deep expertise in Amazon Web Services (AWS), including core services such as EC2, ECS/EKS, Lambda, RDS, DynamoDB, S3, IAM, VPC, and networking.
Strong experience designing and managing containerized applications using Docker and Kubernetes.
Windows Server and IIS administration experience, PowerShell (for Windows/legacy automation).
Experience with MS SQL Server performance tuning and optimization.
Previous experience with performance monitoring in a .NET environment that includes applications written in Angular and C# based applications and backend services
Advanced Infrastructure-as-Code (IaC) experience with Terraform, AWS CloudFormation, and AWS Serverless Application Model (SAM).
Proven ability to architect secure, scalable, and highly available cloud environments in AWS.
Experience deploying and operating serverless architectures using AWS Lambda and event-driven patterns.
Strong hands-on experience with observability and monitoring tools. Extra points for experience specifically with Grafana, BetterStack, and Kibana.
Ability to design performance dashboards, structured logging pipelines, and actionable alerting systems.
Experience operating and tuning message brokers such as RabbitMQ.
Strong understanding of distributed systems, micro-service API architecture, networking, and high-availability design principles.
Proficiency in scripting and automation (e.g., Python, Bash, Shell) to support CI/CD and infrastructure automation workflows.
Excellent written and verbal communication skills in English.
Must be physically located in Canada and legally authorized to work in Canada.
Bonus: Networking experience such as: DNS management, VPN configuration, and Packet Analysis. DevOps experience such as CICD pipeline creation and maintenance specifically in Azure DevOps.

What we Offer

Competitive compensation: $100,000 - $140,000 CAD annually, based on experience and expertise
Meaningful growth opportunities. Work on diverse, high impact projects that expand your technical depth and leadership capabilities.
Collaborative, high-performance culture: Join a team of skilled professionals who value innovation, technical excellence, and shared success.
Fully remote flexibility: 100% work-from-home (WFH, Remote) environment, enabling improved work-life balance.
Modern technology stack: Access to cutting-edge tools and technologies, particularly in data visualization and cloud infrastructure.
Human-centered hiring process: No AI tools are used at any stage of the application or evaluation process.

Application Instructions

Interested candidates are encouraged to submit their resume at earliest convenience.

Pay: $100,000.00-$140,000.00 per year

Benefits

Flexible schedule

Work Location: Remote

Senior Site Reliability Engineer (SRE) Team Lead

Resume Keywords to Include

Job Description

About the Role

What we Offer

Benefits

Similar Jobs

Want AI-powered job matching?

Similar Jobs