Site Reliability Engineer (SRE)

Bhuvitech Innovations

Full Timemid Hybrid

Karnataka, INPosted 5 days ago

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonAWSGCPDockerKubernetesTerraformAnsibleJenkinsLinuxElasticsearchCI/CDDevOpsMicroservices

Job Description

Senior Site Reliability Engineer (SRE)

Job Snapshot

Experience: 6 -10 Years

Location: Bengaluru / Bangalore

Employment Type: Full-Time, Permanent

Work Mode: Hybrid

Education: B.E. / B.Tech in Computer Science, IT, or equivalent

No. of Openings: 3

Salary: As per industry standards (Not disclosed)

Industry: IT - Software / Technology

Functional Area: Engineering - DevOps / SRE / Cloud Infrastructure

Role: Senior Site Reliability Engineer

Key Skills

Python, SRE, Site Reliability Engineering, ELK Stack, Elasticsearch, Kibana, AWS, GCP, Google Cloud Platform, Kubernetes, GKE, EKS, Docker, Helm, Prometheus, Grafana, OpenTelemetry, Alertmanager, Loki, Datadog, Splunk, Cribl, Vector, Terraform, Ansible, Packer, Jenkins, CI/CD, Linux, Observability, Monitoring, SLO, SLI, Incident Management, Infrastructure as Code, Cloud Monitoring, Microservices, Distributed Systems

Job Description

We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability, cloud-native infrastructure, and large-scale distributed systems. This is a highly hands-on role focused on designing, building, and operating reliable, observable, and scalable platforms running on Kubernetes with primary focus on GCP and AWS.

What You Will Work On:

Build and operate a centralized observability platform for metrics, logs, traces, and alerting across Kubernetes workloads using Prometheus, Grafana, OpenTelemetry, and GCP Cloud Monitoring.
Define and drive SLOs, SLIs, and error budgets to improve reliability, reduce MTTR, and guide release decisions.
Design, operate, and optimize EKS/GKE-based Kubernetes platforms using Helm and containerized workloads with Docker.
Develop Python-based automation and tooling for observability, SLO reporting, incident response, and operational workflows.
Lead incident response for production issues, conduct blameless postmortems, and drive long-term reliability improvements.
Optimize platform scalability, performance, and cloud cost efficiency with a strong focus on GCP and AWS.
Act as a technical leader influencing architecture and mentoring teams on reliability and observability best practices.

Roles & Responsibilities

Reliability & Operations:

Design, implement, and maintain highly available and resilient systems in Kubernetes-based production environments.
Define and enforce SLOs, SLIs, and error budgets across critical services.
Lead incident response, root cause analysis (RCA), and blameless postmortems.
Eliminate toil through automation, improving MTTR and overall system uptime.

Observability (Core Focus):

Architect and operate observability platforms for metrics, logging, distributed tracing, and alerting at scale.
Work hands-on with Prometheus, Alertmanager, OpenTelemetry, Grafana, Loki, ELK Stack, and OpenSearch.
Implement cloud-native monitoring using GCP Cloud Monitoring & Logging and AWS CloudWatch.
Establish actionable, low-noise alerting standards with runbooks and clear escalation paths.

Cloud & Platform Engineering:

Build and manage production infrastructure on GCP (preferred) and/or AWS using Infrastructure as Code.
Operate and optimize Kubernetes clusters GKE (preferred) and EKS.
Deploy and manage services using Helm charts.
Manage containerized workloads with Docker.
Provision infrastructure using Terraform, Ansible, and Packer.

Automation & Tooling:

Develop automation and internal tooling using Python for reliability, monitoring, and ops workflows.
Build custom tools for SLO tracking, capacity planning, and incident management.
Integrate CI/CD pipelines (Jenkins or equivalent) with observability and reliability checks.
Implement log routing and management using Cribl and Vector.

Collaboration & Leadership:

Mentor junior engineers on SRE practices and observability.
Influence architecture decisions with a reliability-first mindset.
Collaborate across engineering, QA, and product teams to embed SRE culture.

Desired Candidate Profile

Mandatory Skills:

Python Strong proficiency in scripting, automation, and tooling development.
Site Reliability Engineering Hands-on experience with SLOs, SLIs, error budgets, incident management, and