Principal Site Reliability Engineer As a Principal Site Reliability Engineer , you will join a high-performing engineering organization responsible for building, operating, and scaling large-scale distributed platforms and enterprise backend systems.

You will drive reliability, observability, automation, and operational excellence initiatives to ensure highly available, secure, and cost-efficient infrastructure.

You will play a strategic role in defining platform reliability standards, improving system resilience, and partnering closely with engineering and architecture teams to support scalable production environments.

Key Responsibilities Design, implement, and maintain observability, monitoring, and alerting solutions for mission-critical platforms and backend services. Build and manage telemetry pipelines, centralized logging platforms, and operational dashboards using tools such as Splunk, Prometheus, Grafana, and Open Telemetry .

Define and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and availability metrics across services and APIs.

Participate in on-call rotations and lead resolution of critical production incidents, including root cause analysis and post-incident reviews. Collaborate with platform and infrastructure teams to enforce governance, compliance, and security standards in production environments. Enhance deployment automation, CI/CD pipelines, and infrastructure provisioning workflows (e.g., Git Lab).

Optimize and scale distributed infrastructure components including Kafka, HAProxy, Rabbit MQ, databases, and API platforms .

Perform capacity planning, performance tuning, and cost optimization for large-scale environments. Champion automation-first operations by eliminating manual processes through scripting and reliability tooling. Develop and maintain operational documentation, runbooks, and knowledge repositories. Mentor engineers and promote a culture of reliability engineering, operational maturity, and continuous improvement.

Qualifications Bachelor’s degree in Computer Science, Engineering, or related discipline (Master’s preferred). 15+ years of overall technology experience with 10+ years in SRE, Dev Ops, or Production Operations within cloud environments.

Proven experience managing monitoring, alerting, and incident response for distributed systems. Strong programming and scripting skills in Python, Java, Bash, or Power Shell .

Solid understanding of database architecture and distributed storage technologies such as Oracle, Cassandra, SOLR, and Kafka .

Hands-on expertise with CI/CD pipelines and Git Lab workflows. Strong experience with SQL and No SQL databases. Advanced knowledge of Linux systems administration, networking fundamentals (DNS, TLS/SSL, load balancing) , and large-scale troubleshooting.

Experience with Kubernetes , container orchestration, and hybrid or multi-cloud deployments (Azure preferred; AWS/GCP/OCI acceptable).

Deep understanding of enterprise security practices including authentication, authorization, encryption, SSH/SFTP, PKI, X.509 certificates, and PGP. Familiarity with ITIL practices and Service Now incident/problem management workflows.

Demonstrated ability to operate effectively in high-availability, incident-driven production environments.

Preferred Qualifications Experience supporting large-scale distributed platforms with strict uptime requirements. Exposure to advanced monitoring analytics and operational intelligence practices.

Experience working in regulated enterprise or telecom environments with strong compliance and audit controls. Understanding of secure API architectures and enterprise integration patterns.

Experience designing zero-downtime deployment strategies and high-availability platforms. Knowledge, Skills & Abilities Deep understanding of Site Reliability Engineering practices including SLOs, SLIs, incident management, postmortems, and resilience engineering .

Strong ability to diagnose performance and reliability issues across infrastructure, application, and network layers. Expertise in automation across observability, configuration management, and deployment workflows. Excellent collaboration and communication skills across engineering, platform, and operations teams. Continuous improvement mindset with strong ownership of platform stability and operational excellence. Passion for proactive monitoring, anomaly detection, and reliability automation.

Principal Site Reliability Engineer

Resume Keywords to Include

Job Description

Similar Jobs

Want AI-powered job matching?

Similar Jobs