Site Reliability Engineer

Full Timemid

Halifax, Nova Scotia, CAPosted 9 weeks ago

Role Overview

Humankind Global Recruitment is hiring a mid-level Site Reliability Engineer. This is a full-time role in Halifax, Nova Scotia. Part of Humankind Global Recruitment's Devops hiring. Full responsibilities, required qualifications, and the apply link are listed in the description below.

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

AWSAzureDockerKubernetesTerraformAnsibleApacheRedis

Job Description

Our client a dynamic Information Technology services company that partners with leading global organizations to deliver innovative, high-quality IT solutions is looking for a Site Reliability Engineer.

As a Site Reliability Engineer (SRE), you will be responsible for ensuring the reliability, availability, and performance of our mission-critical applications and systems. You will be part of a team that bridges the gap between development and operations, leveraging your deep technical expertise and problem-solving skills to implement best practices in infrastructure automation, monitoring, scaling, and incident response.

As member of the team you will mentor and guide more junior SREs, work with cross-functional teams, and drive improvements to systems and processes. A passion for building highly resilient, scalable, and efficient systems is key to success in this role.

The SRE will play a key role helping Team Leads and Senior SRE to cover the gap between the organization as a customer and the team as a Service Provider. Is expected from the SRE being able to lead/mentor/inspire people while can deliver superb technical knowledge to troubleshoot or improve systems.

Responsibilities

Reliability & Availability:

Maintain and improve system reliability, uptime, and performance across production environments.
Set and track service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
Drive improvements in incident response processes, ensuring systems are fault-tolerant and highly available.

Automation & Infrastructure:

Design, implement, and maintain automation tools to deploy and manage infrastructure at scale.
Collaborate with software engineering teams to integrate reliability practices into CI/CD pipelines.
Improve the scalability, resilience, and efficiency of cloud infrastructure.

Monitoring & Observability:

Implement and maintain monitoring systems and alerts to ensure proactive identification of issues.
Define key performance metrics and implement logging, monitoring, and alerting solutions across all services and platforms.

Incident Management & Root Cause Analysis:

Lead and participate in incident response efforts, performing root cause analysis and post-mortems to prevent recurrence.
Champion a culture of blameless post-mortems and continuously improve incident response playbooks.
Provide technical leadership and mentorship to junior SREs and other team members.
Work closely with engineering teams to ensure reliability is a key consideration in application design and development.
Foster a culture of collaboration between development, operations, and SRE teams to ensure continuous improvement in service reliability.
Advocate for and implement changes to improve performance, reduce toil, and optimize resource utilization.
Drive the evolution of operational tooling and processes to enhance the quality of service provided to customers.

Qualifications

7+ year’s experience working as a Site Reliability Engineer is required.

Infrastructure Automation & Configuration Management:

Proficiency in infrastructure-as-code (IaC) and automation tools such as Terraform, Ansible, AWX.
Knowledge of KVM Hypervisor
Experience with containerization technologies like Docker and Kubernetes.
Knowledge of cloud platforms such as AWS (specially S3), Google Cloud, or Azure is a plus.

Monitoring & Observability:

Hands-on experience with monitoring tools such as Zabbux, Prometheus, Grafana, , etc.
Strong understanding of logging and tracing technologies (e.g., ELK stack, Fluentd, OpenTelemetry).
Experience with Redis, RabbitMQ

Distributed Systems & Networking:

Solid understanding of distributed system design principles (CAP theorem, eventual consistency, etc.).
Familiarity with network protocols and debugging tools (e.g., TCP/IP, HTTP/HTTPS, DNS, SMTP, load balancing).
Solid knowledge of HTTP/HTTPS related services like Apache WebServer, HAProxy, Proxies HTTP/HTTPS and Mail services (PostFix, StrongMail)
Familiarity with distributed concepts like GEO DNS, CLB (Cloud Load Balancing), etc.
Deep knowledge of operational security (Service hardening, WAF, Honeypots, SIEMs )
Mastering IPv4 fundamentals (including basic knowledge of routing protocols like BGP and OSPF). IPv6 experience is a plus

Incident Management & Root Cause Analysis:

Proven ability to lead post-incident reviews and write detailed post-mortems.
Experience with incident management tools.

CI/CD & DevOps Practices:

Experience with CI/

Frequently Asked Questions

How do I apply for the Site Reliability Engineer position at Humankind Global Recruitment?

Use the Apply button above to submit your application directly to Humankind Global Recruitment. Most applications take less than 5 minutes if your resume and contact details are ready, and you'll be routed to the employer's official application system to finish.

Where is the Site Reliability Engineer position at Humankind Global Recruitment located?

This position is based in Halifax, Nova Scotia. Humankind Global Recruitment has not indicated remote or hybrid options for this role, so candidates should plan for on-site work.

What does a Site Reliability Engineer at Humankind Global Recruitment earn?

Humankind Global Recruitment has not disclosed a salary range in this posting. Many employers share specifics later in the interview process; you can also ask during a recruiter screen if compensation transparency is important to you.

When was the Site Reliability Engineer role at Humankind Global Recruitment posted?

This role was posted on April 6, 2026 (63 days ago). It's still listed as actively hiring; we re-confirm openings against the source system multiple times per day and remove closed roles.

Browse Remote DevOps Engineer Jobs →

AI-powered job search

Get every job scored to your resume

Upload your resume and get jobs ranked, your resume tailored, and employee contacts found automatically.

Get Started Free

No credit card to start

Site Reliability Engineer

Humankind Global Recruitment

Full Timemid

Halifax, Nova Scotia, CAPosted 9 weeks ago

Role Overview

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

AWSAzureDockerKubernetesTerraformAnsibleApacheRedis

Job Description

Responsibilities

Reliability & Availability:

Maintain and improve system reliability, uptime, and performance across production environments.
Set and track service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
Drive improvements in incident response processes, ensuring systems are fault-tolerant and highly available.

Automation & Infrastructure:

Design, implement, and maintain automation tools to deploy and manage infrastructure at scale.
Collaborate with software engineering teams to integrate reliability practices into CI/CD pipelines.
Improve the scalability, resilience, and efficiency of cloud infrastructure.

Monitoring & Observability:

Implement and maintain monitoring systems and alerts to ensure proactive identification of issues.
Define key performance metrics and implement logging, monitoring, and alerting solutions across all services and platforms.

Incident Management & Root Cause Analysis:

Lead and participate in incident response efforts, performing root cause analysis and post-mortems to prevent recurrence.
Champion a culture of blameless post-mortems and continuously improve incident response playbooks.
Provide technical leadership and mentorship to junior SREs and other team members.
Work closely with engineering teams to ensure reliability is a key consideration in application design and development.
Foster a culture of collaboration between development, operations, and SRE teams to ensure continuous improvement in service reliability.
Advocate for and implement changes to improve performance, reduce toil, and optimize resource utilization.
Drive the evolution of operational tooling and processes to enhance the quality of service provided to customers.

Qualifications

7+ year’s experience working as a Site Reliability Engineer is required.

Infrastructure Automation & Configuration Management:

Proficiency in infrastructure-as-code (IaC) and automation tools such as Terraform, Ansible, AWX.
Knowledge of KVM Hypervisor
Experience with containerization technologies like Docker and Kubernetes.
Knowledge of cloud platforms such as AWS (specially S3), Google Cloud, or Azure is a plus.

Monitoring & Observability:

Hands-on experience with monitoring tools such as Zabbux, Prometheus, Grafana, , etc.
Strong understanding of logging and tracing technologies (e.g., ELK stack, Fluentd, OpenTelemetry).
Experience with Redis, RabbitMQ

Distributed Systems & Networking:

Solid understanding of distributed system design principles (CAP theorem, eventual consistency, etc.).
Familiarity with network protocols and debugging tools (e.g., TCP/IP, HTTP/HTTPS, DNS, SMTP, load balancing).
Solid knowledge of HTTP/HTTPS related services like Apache WebServer, HAProxy, Proxies HTTP/HTTPS and Mail services (PostFix, StrongMail)
Familiarity with distributed concepts like GEO DNS, CLB (Cloud Load Balancing), etc.
Deep knowledge of operational security (Service hardening, WAF, Honeypots, SIEMs )
Mastering IPv4 fundamentals (including basic knowledge of routing protocols like BGP and OSPF). IPv6 experience is a plus

Incident Management & Root Cause Analysis:

Proven ability to lead post-incident reviews and write detailed post-mortems.
Experience with incident management tools.

CI/CD & DevOps Practices:

Experience with CI/

Frequently Asked Questions

How do I apply for the Site Reliability Engineer position at Humankind Global Recruitment?

Where is the Site Reliability Engineer position at Humankind Global Recruitment located?

This position is based in Halifax, Nova Scotia. Humankind Global Recruitment has not indicated remote or hybrid options for this role, so candidates should plan for on-site work.

What does a Site Reliability Engineer at Humankind Global Recruitment earn?

When was the Site Reliability Engineer role at Humankind Global Recruitment posted?

This role was posted on April 6, 2026 (63 days ago). It's still listed as actively hiring; we re-confirm openings against the source system multiple times per day and remove closed roles.

Browse Remote DevOps Engineer Jobs →