Site Reliability Engineer (SRE)
Bhuvitech InnovationsResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Senior Site Reliability Engineer (SRE)
Job Snapshot
Experience: 6 -10 Years
Location: Bengaluru / Bangalore
Employment Type: Full-Time, Permanent
Work Mode: Hybrid
Education: B.E. / B.Tech in Computer Science, IT, or equivalent
No. of Openings: 3
Salary: As per industry standards (Not disclosed)
Industry: IT - Software / Technology
Functional Area: Engineering - DevOps / SRE / Cloud Infrastructure
Role: Senior Site Reliability Engineer
Key Skills
Python, SRE, Site Reliability Engineering, ELK Stack, Elasticsearch, Kibana, AWS, GCP, Google Cloud Platform, Kubernetes, GKE, EKS, Docker, Helm, Prometheus, Grafana, OpenTelemetry, Alertmanager, Loki, Datadog, Splunk, Cribl, Vector, Terraform, Ansible, Packer, Jenkins, CI/CD, Linux, Observability, Monitoring, SLO, SLI, Incident Management, Infrastructure as Code, Cloud Monitoring, Microservices, Distributed Systems
Job Description
We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability, cloud-native infrastructure, and large-scale distributed systems. This is a highly hands-on role focused on designing, building, and operating reliable, observable, and scalable platforms running on Kubernetes with primary focus on GCP and AWS.
What You Will Work On:
- Build and operate a centralized observability platform for metrics, logs, traces, and alerting across Kubernetes workloads using Prometheus, Grafana, OpenTelemetry, and GCP Cloud Monitoring.
- Define and drive SLOs, SLIs, and error budgets to improve reliability, reduce MTTR, and guide release decisions.
- Design, operate, and optimize EKS/GKE-based Kubernetes platforms using Helm and containerized workloads with Docker.
- Develop Python-based automation and tooling for observability, SLO reporting, incident response, and operational workflows.
- Lead incident response for production issues, conduct blameless postmortems, and drive long-term reliability improvements.
- Optimize platform scalability, performance, and cloud cost efficiency with a strong focus on GCP and AWS.
- Act as a technical leader influencing architecture and mentoring teams on reliability and observability best practices.
Roles & Responsibilities
Reliability & Operations:
- Design, implement, and maintain highly available and resilient systems in Kubernetes-based production environments.
- Define and enforce SLOs, SLIs, and error budgets across critical services.
- Lead incident response, root cause analysis (RCA), and blameless postmortems.
- Eliminate toil through automation, improving MTTR and overall system uptime.
Observability (Core Focus):
- Architect and operate observability platforms for metrics, logging, distributed tracing, and alerting at scale.
- Work hands-on with Prometheus, Alertmanager, OpenTelemetry, Grafana, Loki, ELK Stack, and OpenSearch.
- Implement cloud-native monitoring using GCP Cloud Monitoring & Logging and AWS CloudWatch.
- Establish actionable, low-noise alerting standards with runbooks and clear escalation paths.
Cloud & Platform Engineering:
- Build and manage production infrastructure on GCP (preferred) and/or AWS using Infrastructure as Code.
- Operate and optimize Kubernetes clusters GKE (preferred) and EKS.
- Deploy and manage services using Helm charts.
- Manage containerized workloads with Docker.
- Provision infrastructure using Terraform, Ansible, and Packer.
Automation & Tooling:
- Develop automation and internal tooling using Python for reliability, monitoring, and ops workflows.
- Build custom tools for SLO tracking, capacity planning, and incident management.
- Integrate CI/CD pipelines (Jenkins or equivalent) with observability and reliability checks.
- Implement log routing and management using Cribl and Vector.
Collaboration & Leadership:
- Mentor junior engineers on SRE practices and observability.
- Influence architecture decisions with a reliability-first mindset.
- Collaborate across engineering, QA, and product teams to embed SRE culture.
Desired Candidate Profile
Mandatory Skills:
- Python Strong proficiency in scripting, automation, and tooling development.
- Site Reliability Engineering Hands-on experience with SLOs, SLIs, error budgets, incident management, and
About Bhuvitech Innovations
Bhuvitech Innovations
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free