About Us

Space Ops is a fast-growing Canadian fintech redefining how businesses and individuals manage financial transactions. Our secure, scalable payment platforms process mission-critical workloads where reliability isn’t optional, it’s foundational.

We are building a reliability-first engineering culture and are looking for a Site Reliability Engineer (SRE) who thrives in high-availability, cloud-native environments.

This is an engineering role, not a helpdesk or general IT position.

The Role

As a Site Reliability Engineer, you will ensure the reliability, scalability, performance, and security of our AWS-based systems. You will work closely with software engineers and platform teams to implement SLIs/SLOs, manage error budgets, operate EKS clusters, and build automated, self-healing infrastructure.

You will play a key role in incident leadership, observability maturity, and reducing operational toil through automation.

What You’ll Do

Reliability & SLO Ownership

Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
Manage error budgets to balance feature velocity and reliability
Reduce MTTD, MTTA, and MTTR
Lead blameless postmortems and implement preventative improvements

Cloud & Kubernetes Operations

Operate and optimize AWS EKS clusters
Design multi-AZ, fault-tolerant architectures
Implement and tune auto-scaling strategies
Improve system resilience and capacity planning

Observability & Monitoring

Build and maintain metrics, logging, tracing, and alerting systems
Implement actionable, symptom-based alerts (low-noise, high-signal)
Maintain full observability coverage across critical paths
Implement synthetic monitoring (URL, API, and user journey checks)

Automation & Infrastructure as Code

Eliminate operational toil through automation
Build and maintain infrastructure using Terraform, CDK, or CloudFormation
Improve CI/CD pipelines with safe deployment practices (canary releases, feature flags)
Develop scripts and tooling in Python, Go, or Bash

Collaboration & Engineering Excellence

Partner with developers to design reliable, operable systems
Participate in architectural reviews with a reliability-first mindset
Support secure, scalable system design
Document runbooks, playbooks, and operational standards

What You Bring

Required Experience

5+ years in SRE, DevOps, or cloud infrastructure roles
Production experience with AWS (EC2, EKS/ECS, Lambda, ALB/NLB, VPC, S3, RDS, IAM, CloudTrail)
Strong Kubernetes operational experience (EKS preferred)
Proven experience with SLOs, SLIs, and error budgets
Hands-on Infrastructure as Code experience (Terraform, CDK, or CloudFormation)
Experience with observability tools (CloudWatch, Prometheus, Grafana, X-Ray, ADOT, New Relic, etc.)
Production on-call and incident management experience

Technical Skills

Strong Linux and networking fundamentals
Proficiency in Python, Go, Java, or Bash
Experience with MongoDB, PostgreSQL, Redis, or RabbitMQ
CI/CD experience (GitHub, Jenkins, CodePipeline, etc.)
Familiarity with security best practices (IAM, KMS, Secrets Manager, GuardDuty)

Nice to Have

Performance engineering or chaos testing experience
Experience in fintech or regulated environments
Experience working with distributed storage systems

Compensation & Benefits

Compensation: $80,000 – $90,000 annually
Annual performance bonus
Comprehensive health, dental & optical coverage
Employee Assistance Program (EAP)
Health Spending Account (HSA)
Monthly company wellness & engagement events
Perkopolis staff discounts
Training & development opportunities
Onsite gym access
Gym concessions with GoodLife & Planet Fitness

Job Type: Full-time

Pay: $80,000.00-$90,000.00 per year

Benefits

Company events
Dental care
Discounted or free food
Employee assistance program
Life insurance
Paid time off
Vision care
Wellness program

Ability to commute/relocate:

Hamilton, ON: reliably commute or plan to relocate before starting work (required)

Work Location: In person

About Us

We are building a reliability-first engineering culture and are looking for a Site Reliability Engineer (SRE) who thrives in high-availability, cloud-native environments.

This is an engineering role, not a helpdesk or general IT position.

The Role

You will play a key role in incident leadership, observability maturity, and reducing operational toil through automation.

What You’ll Do

Reliability & SLO Ownership

Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
Manage error budgets to balance feature velocity and reliability
Reduce MTTD, MTTA, and MTTR
Lead blameless postmortems and implement preventative improvements

Cloud & Kubernetes Operations

Operate and optimize AWS EKS clusters
Design multi-AZ, fault-tolerant architectures
Implement and tune auto-scaling strategies
Improve system resilience and capacity planning

Observability & Monitoring

Build and maintain metrics, logging, tracing, and alerting systems
Implement actionable, symptom-based alerts (low-noise, high-signal)
Maintain full observability coverage across critical paths
Implement synthetic monitoring (URL, API, and user journey checks)

Automation & Infrastructure as Code

Eliminate operational toil through automation
Build and maintain infrastructure using Terraform, CDK, or CloudFormation
Improve CI/CD pipelines with safe deployment practices (canary releases, feature flags)
Develop scripts and tooling in Python, Go, or Bash

Collaboration & Engineering Excellence

Partner with developers to design reliable, operable systems
Participate in architectural reviews with a reliability-first mindset
Support secure, scalable system design
Document runbooks, playbooks, and operational standards

What You Bring

Required Experience

5+ years in SRE, DevOps, or cloud infrastructure roles
Production experience with AWS (EC2, EKS/ECS, Lambda, ALB/NLB, VPC, S3, RDS, IAM, CloudTrail)
Strong Kubernetes operational experience (EKS preferred)
Proven experience with SLOs, SLIs, and error budgets
Hands-on Infrastructure as Code experience (Terraform, CDK, or CloudFormation)
Experience with observability tools (CloudWatch, Prometheus, Grafana, X-Ray, ADOT, New Relic, etc.)
Production on-call and incident management experience

Technical Skills

Strong Linux and networking fundamentals
Proficiency in Python, Go, Java, or Bash
Experience with MongoDB, PostgreSQL, Redis, or RabbitMQ
CI/CD experience (GitHub, Jenkins, CodePipeline, etc.)
Familiarity with security best practices (IAM, KMS, Secrets Manager, GuardDuty)

Nice to Have

Performance engineering or chaos testing experience
Experience in fintech or regulated environments
Experience working with distributed storage systems

Compensation & Benefits

Compensation: $80,000 – $90,000 annually
Annual performance bonus
Comprehensive health, dental & optical coverage
Employee Assistance Program (EAP)
Health Spending Account (HSA)
Monthly company wellness & engagement events
Perkopolis staff discounts
Training & development opportunities
Onsite gym access
Gym concessions with GoodLife & Planet Fitness

Job Type: Full-time

Pay: $80,000.00-$90,000.00 per year

Benefits

Company events
Dental care
Discounted or free food
Employee assistance program
Life insurance
Paid time off
Vision care
Wellness program

Ability to commute/relocate:

Hamilton, ON: reliably commute or plan to relocate before starting work (required)

Work Location: In person

Site Reliability Engineer

Resume Keywords to Include

Job Description

About Us

Nice to Have

Benefits

Want AI-powered job matching?

Loading jobs…

Site Reliability Engineer

Resume Keywords to Include

Job Description

About Us

Nice to Have

Benefits

Similar Jobs

Want AI-powered job matching?

Salary Context

Similar Jobs