Job Description

IREN is a leading AI Cloud Service Provider, delivering large-scale GPU clusters for AI training and inference. IREN’s vertically integrated platform is underpinned by its expansive portfolio of grid-connected land and data centers in renewable-rich regions across the U.S. and Canada.

With 100% renewable energy, we build, own and operate our data centers and take pride in being at the forefront of sustainable solutions for the ever-evolving applications of high-performance compute. We believe that human progress is invaluable, but it should be done in the right way – responsibly, sustainably and having a positive impact on the communities we operate in.

The Platform Infrastructure Division builds and operates the foundational systems that power IREN’s GPU-enabled, multi-tenant compute platform. We are hiring a Senior Service Reliability Engineer to lead observability architecture, and reliability engineering, across large scale distributed systems.

This role owns the design, scalability, and operational excellence of the observability platform. You will transform high volume metrics, logs, events, and traces into actionable intelligence that improves reliability, performance, and operational efficiency.

You will be on the bleeding edge of the AI Revolution, building monitoring systems for thousands of GPUs and integrating reliability principles into the heart of our operations.

Job Requirements

Technical Skills

7+ years in Site Reliability Engineering, DevOps, Infrastructure Engineering, or similar roles.
3+ years owning observability platforms at meaningful scale.
Deep understanding of distributed systems and production operations.
Strong hands-on experience with Prometheus and Grafana in large-scale environments.
Experience with tracing and logging ecosystems including OpenTelemetry, Jaeger, Tempo, Loki, or Elasticsearch.
Strong Linux systems engineering background including performance analysis and troubleshooting.
Experience operating Kubernetes in production environments.
Strong networking fundamentals including TCP/IP, DNS, and service-to-service communication patterns.
Proficiency in Go, Python, or similar modern programming languages.
Experience building automation and internal reliability tooling.
Experience managing high-volume telemetry ingestion and time-series storage systems.

Soft Skills & Competencies

Strong analytical and troubleshooting capabilities across complex distributed systems.
Ownership mindset with a strong sense of responsibility toward production systems.
Effective communicator able to collaborate across engineering, infrastructure, and leadership teams.
Pragmatic problem solver focused on reliability, scalability, and operational excellence.

Nice-to-Have

Experience operating GPU-dense environments or high-performance compute clusters.
Experience integrating GPU telemetry and hardware health signals into observability systems.
Familiarity with InfiniBand, RoCE, or advanced data center networking fabrics.
Experience integrating out-of-band management telemetry such as Redfish or BMC event streams.
Experience supporting AI training infrastructure or research compute environments.

Job Responsibilities

Observability Architecture & Telemetry Strategy

Design and own end-to-end observability architecture across metrics, logs, traces, and event streams.
Define telemetry standards and enforce consistent metadata across services and infrastructure domains.
Establish and operationalize service level indicators and service level objectives across critical systems.
Implement error budget-driven alerting strategies that prioritize signal over noise.
Architect highly available, scalable, and cost-efficient telemetry ingestion and storage systems.
Develop executive and engineering-level dashboards that surfaces reliability posture and system health trends.

Incident Management & Operational Excellence

Own and evolve the full incident lifecycle across detection, triage, mitigation, resolution, and recovery.
Design severity models, escalation paths, and response playbooks across software and infrastructure domains.
Lead complex cross-functional incident response efforts involving distributed systems and GPU infrastructure.
Conduct structured, blameless post-incident reviews and drive long-term systemic improvements.
Track and improve key operational metrics including mean time to detect, mean time to recover, and change failure rate.
Partner with engineering teams to eliminate recurring incidents through automation and architectural improvements.

Software & Infrastructure Observability

Partner with engineering teams to standardize instrumentation across applications and services.
Drive adoption of distributed tracing and structured logging best practices.
Build investigative workflows that connect application-level symptoms to infrastructure and hardware signals.
Correlate GPU health events and hardware telemetry with application performance and reliability metrics.
Create topology-aware views of large-scale systems to accelerate incident diagnosis and root cause analysis.

Observability Platform Engineering

Design and operate Prometheus at scale including federation, recording rules, and alert optimization.
Build and maintain Grafana dashboards, alerting strategies, and role-based access models.
Operate log aggregation and indexing platforms such as Loki or Elasticsearch.
Implement distributed tracing systems using Open Telemetry and compatible backends.
Manage telemetry ingestion pipelines, retention strategies, and storage tiering policies.
Optimize metric cardinality, labeling standards, and cost-performance trade-offs at scale.

Job Benefits

At IREN, we offer a comprehensive Total Rewards package designed to support your health, well-being, and long-term success. Our Canada package includes:

Compensation

The expected base salary for this role starts at CAD$135,000 - 150,000/annum.
Actual compensation will be determined based on factors such as experience, qualifications, and market data for the region.
Total Compensation package may be inclusive of annual incentive bonus, and equity (long-term incentive)

Health & Wellness

Medical, dental, and vision insurance coverage – 100% company paid for employees and dependents
Company-paid life and disability insurance
Voluntary life and critical illness coverage available
Employee Assistance Program and virtual health care platform

Financial Well-Being

RRSP with company match
Voluntary TFSA

Time Off & Flexibility

3 weeks annually for vacation and paid holidays

Growth & Development

Opportunities for advancement and internal mobility
Training and personal development opportunities

Lifestyle & Culture

Company events and team-building activities

We value diverse perspectives and believe that skills can be developed. If you’re passionate about this role, we want to hear from you — whether you meet every criteria or not. Your unique experiences might be exactly what we need!

Podtech Data Centers Inc., the employing entity and proud member of the IREN Group is an equal opportunity employer that is committed to creating an inclusive workplace. We evaluate qualified applicants without regard to race, colour, religion, age, sex, sexual orientation, gender identity, genetic information, national origin, disability, veteran status, and other legally protected characteristics.

This job will remain posted until filled. While we appreciate all applications we receive, we are only able to contact candidates under consideration.

By applying for this position and submitting your resume and application materials, you consent to the processing of your personal information in accordance with our Job Applicant Privacy Statement available on our website at www.iren.com.

Senior Service Reliability Engineer

Resume Keywords to Include

Job Description

Job Description

Compensation

More Jobs at IREN

Want AI-powered job matching?

More Jobs at IREN