Resume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Principal Site Reliability Engineer
As a Principal Site Reliability Engineer, you will join a high-performing engineering organization responsible for building, operating, and scaling large-scale distributed platforms and enterprise backend systems. You will drive reliability, observability, automation, and operational excellence initiatives to ensure highly available, secure, and cost-efficient infrastructure.You will play a strategic role in defining platform reliability standards, improving system resilience, and partnering closely with engineering and architecture teams to support scalable production environments.
Key Responsibilities
- Design, implement, and maintain observability, monitoring, and alerting solutions for mission-critical platforms and backend services.
- Build and manage telemetry pipelines, centralized logging platforms, and operational dashboards using tools such as Splunk, Prometheus, Grafana, and OpenTelemetry.
- Define and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and availability metricsacross services and APIs.
- Participate in on-call rotations and lead resolution of critical production incidents, including root cause analysis and post-incident reviews.
- Collaborate with platform and infrastructure teams to enforce governance, compliance, and security standards in production environments.
- Enhance deployment automation, CI/CD pipelines, and infrastructure provisioning workflows (e.g., GitLab).
- Optimize and scale distributed infrastructure components including Kafka, HAProxy, RabbitMQ, databases, and API platforms.
- Perform capacity planning, performance tuning, and cost optimization for large-scale environments.
- Champion automation-first operations by eliminating manual processes through scripting and reliability tooling.
- Develop and maintain operational documentation, runbooks, and knowledge repositories.
- Mentor engineers and promote a culture of reliability engineering, operational maturity, and continuous improvement.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or related discipline (Master’s preferred).
- 15+ years of overall technology experience with 10+ years in SRE, DevOps, or Production Operations within cloud environments.
- Proven experience managing monitoring, alerting, and incident response for distributed systems.
- Strong programming and scripting skills in Python, Java, Bash, or PowerShell.
- Solid understanding of database architecture and distributed storage technologies such as Oracle, Cassandra, SOLR, and Kafka.
- Hands-on expertise with CI/CD pipelines and GitLab workflows.
- Strong experience with SQL and NoSQL databases.
- Advanced knowledge of Linux systems administration, networking fundamentals (DNS, TLS/SSL, load balancing), and large-scale troubleshooting.
- Experience with Kubernetes, container orchestration, and hybrid or multi-cloud deployments (Azure preferred; AWS/GCP/OCI acceptable).
- Deep understanding of enterprise security practices including authentication, authorization, encryption, SSH/SFTP, PKI, X.509 certificates, and PGP.
- Familiarity with ITIL practices and ServiceNow incident/problem management workflows.
- Demonstrated ability to operate effectively in high-availability, incident-driven production environments.
Preferred Qualifications
- Experience supporting large-scale distributed platforms with strict uptime requirements.
- Exposure to advanced monitoring analytics and operational intelligence practices.
- Experience working in regulated enterprise or telecom environments with strong compliance and audit controls.
- Understanding of secure API architectures and enterprise integration patterns.
- Experience designing zero-downtime deployment strategies and high-availability platforms.
Knowledge, Skills & Abilities
- Deep understanding of Site Reliability Engineering practices including SLOs, SLIs, incident management, postmortems, and resilience engineering.
- Strong ability to diagnose performance and reliability issues across infrastructure, application, and network layers.
- Expertise in automation across observability, configuration management, and deployment workflows.
- Excellent collaboration and communication skills across engineering, platform, and operations teams.
- Continuous improvement mindset with strong ownership of platform stability and operational excellence.
- Passion for proactive monitoring, anomaly detection, and reliability automation.
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free