Skip to main content
D

Senior Site Reliability Engineer

Devopie Inc.
Be an Early ApplicantFull Timesenior
Hamilton, Ontario, CAPosted March 18, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonJavaGoRubyBashAWSDockerKubernetesTerraformLinuxUnixPostgreSQLMongoDBRedisGitHubRabbitMQCI/CDSaaS

Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score

Job Description

💡 What You’ll Do

You’ll operate at the intersection of software engineering and systems engineering , building resilient systems that scale, self-heal, and empower developers to ship safely.

🔎 Reliability Engineering

  • Define and manage SLIs, SLOs, and error budgets
  • Reduce MTTD, MTTA, and MTTR through structured incident response
  • Conduct blameless postmortems and drive preventative improvements
  • Champion reliability in architectural reviews and production readiness

📊 Observability & Monitoring

  • Design actionable, symptom-based alerts (not noise)
  • Build dashboards and tracing systems using tools like CloudWatch, Prometheus, Grafana, New Relic, X-Ray, ADOT
  • Implement synthetic monitoring to simulate real user journeys (URLs, clickpaths, APIs)
  • Ensure full observability coverage across critical paths

☁️ Cloud & Infrastructure

  • Operate and optimize AWS environments (EC2, EKS/ECS, Lambda, VPC, RDS, IAM, S3, ALB/NLB, CloudTrail)
  • Build resilient, multi-AZ and regionally replicated systems
  • Implement autoscaling and fault-tolerant architecture
  • Leverage Infrastructure as Code (Terraform, CDK, CloudFormation)

🤖 Automation & Toil Reduction

  • Eliminate manual processes through automation
  • Build self-healing infrastructure
  • Improve CI/CD pipelines with safe deployment strategies (canary releases, feature flags)
  • Write production-quality code (not just scripts) in Python, Go, Ruby, Bash, or Java

📈 Performance & Capacity Planning

  • Analyze system metrics and traffic patterns
  • Conduct load testing, chaos testing, and capacity modeling
  • Identify bottlenecks and proactively optimize systems

🤝 Cross-Functional Collaboration

You’ll work closely with:

  • Engineering & Platform teams on scalable system design
  • Security teams on IAM, KMS, GuardDuty, secrets management
  • Product leaders to align reliability with roadmap priorities
  • Cloud vendors and SaaS providers during critical incidents

🧠 What You Bring

Must-Have Experience

  • Bachelor’s degree in Computer Science, Software Engineering, or related field
  • Strong Linux/Unix systems knowledge
  • Deep AWS experience
  • Hands-on Kubernetes (EKS/ECS), Docker, and container orchestration
  • Infrastructure as Code (Terraform, CDK, CloudFormation)
  • Production on-call and incident management experience
  • Strong understanding of MTTx metrics (MTTD, MTTR, MTBF, etc.)
  • Experience with MongoDB, PostgreSQL, Redis, RabbitMQ
  • Experience with observability and monitoring platforms
  • CI/CD pipeline experience (GitHub, Kubernetes, etc.)

Nice-to-Have

  • Performance engineering and chaos testing
  • Experience in fintech or regulated environments
  • Knowledge of distributed storage systems (NFS, HDFS, Ceph, S3)
  • Familiarity with dynamic resource frameworks (Kubernetes, Mesos, Yarn)

Want AI-powered job matching?

Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.

Get Started Free