Resume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Job Description
IREN is a leading AI Cloud Service Provider, delivering large-scale GPU clusters for AI training and inference. IREN’s vertically integrated platform is underpinned by its expansive portfolio of grid-connected land and data centers in renewable-rich regions across the U.S. and Canada.
With 100% renewable energy, we build, own and operate our data centers and take pride in being at the forefront of sustainable solutions for the ever-evolving applications of high-performance compute. We believe that human progress is invaluable, but it should be done in the right way – responsibly, sustainably and having a positive impact on the communities we operate in.
The Platform Infrastructure Division builds and operates the foundational systems that power IREN’s GPU-enabled, multi-tenant compute platform. We are hiring a Senior Service Reliability Engineer to lead observability architecture, and reliability engineering, across large scale distributed systems.
This role owns the design, scalability, and operational excellence of the observability platform. You will transform high volume metrics, logs, events, and traces into actionable intelligence that improves reliability, performance, and operational efficiency.
You will be on the bleeding edge of the AI Revolution, building monitoring systems for thousands of GPUs and integrating reliability principles into the heart of our operations.
Job Requirements
Technical Skills
- 7+ years in Site Reliability Engineering, DevOps, Infrastructure Engineering, or similar roles.
- 3+ years owning observability platforms at meaningful scale.
- Deep understanding of distributed systems and production operations.
- Strong hands-on experience with Prometheus and Grafana in large-scale environments.
- Experience with tracing and logging ecosystems including OpenTelemetry, Jaeger, Tempo, Loki, or Elasticsearch.
- Strong Linux systems engineering background including performance analysis and troubleshooting.
- Experience operating Kubernetes in production environments.
- Strong networking fundamentals including TCP/IP, DNS, and service-to-service communication patterns.
- Proficiency in Go, Python, or similar modern programming languages.
- Experience building automation and internal reliability tooling.
- Experience managing high-volume telemetry ingestion and time-series storage systems.
Soft Skills & Competencies
- Strong analytical and troubleshooting capabilities across complex distributed systems.
- Ownership mindset with a strong sense of responsibility toward production systems.
- Effective communicator able to collaborate across engineering, infrastructure, and leadership teams.
- Pragmatic problem solver focused on reliability, scalability, and operational excellence.
Nice-to-Have
- Experience operating GPU-dense environments or high-performance compute clusters.
- Experience integrating GPU telemetry and hardware health signals into observability systems.
- Familiarity with InfiniBand, RoCE, or advanced data center networking fabrics.
- Experience integrating out-of-band management telemetry such as Redfish or BMC event streams.
- Experience supporting AI training infrastructure or research compute environments.
Job Responsibilities
Observability Architecture & Telemetry Strategy
- Design and own end-to-end observability architecture across metrics, logs, traces, and event streams.
- Define telemetry standards and enforce consistent metadata across services and infrastructure domains.
- Establish and operationalize service level indicators and service level objectives across critical systems.
- Implement error budget-driven alerting strategies that prioritize signal over noise.
- Architect highly available, scalable, and cost-efficient telemetry ingestion and storage systems.
- Develop executive and engineering-level dashboards that surfaces reliability posture and system health trends.
Incident Management & Operational Excellence
- Own and evolve the full incident lifecycle across detection, triage, mitigation, resolution, and recovery.
- Design severity models, escalation paths, and response playbooks across software and infrastructure domains.
- Lead complex cross-functional incident response efforts involving distributed systems and GPU infrastructure.
- Conduct structured, blameless post-incident reviews and drive long-term systemic improvements.
- Track and improve key operational metrics including mean time to detect, mean time to recover, and change failure rate.
- Partner with engineering teams to eliminate recurring incidents through automation and architectural improvements.
Software & Infrastructure Observability
- Partner with engineering teams to standardize instrumentation across applications and services.
- Drive adoption of distributed tracing and structured logging best practices.
- Build investigative workflows that connect application-level symptoms to infrastructure and hardware signals.
- Correlate GPU health events and hardware
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free