Senior Software Engineer - SRE

PeopleTree Knowledge Services

Full Timesenior

Rajkot, Gujarat, INPosted 4 days ago

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonJavaScriptJavaGoNode.jsAWSKubernetesCI/CDMicroservicesAPI

Job Description

Title: Senior Software Engineer / SRE (Observability Focus)

Timings: 3:00 PM – 1:00 AM IST (Monday to Friday)

Work Mode: Remote

Role Summary

We are seeking a Senior Software Engineer / SRE (Observability Focus) to drive platform reliability, monitoring, and operational excellence. This role combines software engineering (60–70%) and site reliability engineering (30-40%), with a strong emphasis on Kubernetes-based environments and observability platforms. You will play a key role in owning and operating internal engineering platforms, improving system reliability, scalability, and performance across cloud-native and microservices architectures.

The ideal candidate is proactive, takes end-to-end ownership, and drives continuous improvements rather than reactive support.

What You’ll Do

Design and develop automation tools, services, and integrations to improve platform reliability and operational efficiency
Implement and manage observability solutions (metrics, logs, tracing, dashboards, alerts) using platforms like Datadog, Prometheus, and Grafana
Own and operate internal observability and monitoring platforms, ensuring reliability, scalability, and performance
Work with Kubernetes environments to deploy, monitor, and optimize containerized applications
Integrate observability into CI/CD pipelines to improve deployment visibility and system health
Collaborate with engineering teams to enhance APM practices and reliability engineering standards
Automate monitoring configurations and operational workflows using Python and scripting
Support cloud-based observability by integrating AWS services with monitoring platforms
Provide operational and training support for observability platforms (e.g., Datadog) used by engineering teams
Proactively identify system bottlenecks and lead initiatives to improve availability, scalability, and performance

Key Requirements (Must-Have Skills)

Strong programming skills in at least one of the following: Python, JavaScript (Node.js), or Java
Hands-on experience with Kubernetes (deployment, operations, monitoring)
Strong experience with observability tools, especially Datadog (preferred), Prometheus, and Grafana
Experience with API integrations and working with distributed systems
Solid understanding of monitoring, logging, and distributed tracing concepts
Experience with AWS cloud services and cloud-native architectures
Experience integrating observability into CI/CD pipelines
Strong automation skills using scripting and infrastructure tooling
Demonstrated experience in owning production systems/platforms, ensuring reliability and performance

Strongly Preferred

Experience operating or owning an internal engineering or observability platform
Proven track record of improving system reliability, scalability, and performance proactively
Experience managing Datadog agents, API keys, access controls, and platform configurations
Ability to lead incident response, troubleshooting, and performance optimization efforts
Experience working in cross-functional teams and enabling engineering teams with observability best practices

Nice-to-Have

Experience with Go (Golang)
Familiarity with tools like New Relic, Dynatrace, Elastic Observability, or Splunk
Knowledge of security and access management best practices in observability platforms
Experience working in distributed microservices environments at scale.