Senior Software Engineer - SRE
PeopleTree Knowledge ServicesResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Title: Senior Software Engineer / SRE (Observability Focus)
Timings: 3:00 PM – 1:00 AM IST (Monday to Friday)
Work Mode: Remote
Role Summary
We are seeking a Senior Software Engineer / SRE (Observability Focus) to drive platform reliability, monitoring, and operational excellence. This role combines software engineering (60–70%) and site reliability engineering (30-40%), with a strong emphasis on Kubernetes-based environments and observability platforms. You will play a key role in owning and operating internal engineering platforms, improving system reliability, scalability, and performance across cloud-native and microservices architectures.
The ideal candidate is proactive, takes end-to-end ownership, and drives continuous improvements rather than reactive support.
What You’ll Do
- Design and develop automation tools, services, and integrations to improve platform reliability and operational efficiency
- Implement and manage observability solutions (metrics, logs, tracing, dashboards, alerts) using platforms like Datadog, Prometheus, and Grafana
- Own and operate internal observability and monitoring platforms, ensuring reliability, scalability, and performance
- Work with Kubernetes environments to deploy, monitor, and optimize containerized applications
- Integrate observability into CI/CD pipelines to improve deployment visibility and system health
- Collaborate with engineering teams to enhance APM practices and reliability engineering standards
- Automate monitoring configurations and operational workflows using Python and scripting
- Support cloud-based observability by integrating AWS services with monitoring platforms
- Provide operational and training support for observability platforms (e.g., Datadog) used by engineering teams
- Proactively identify system bottlenecks and lead initiatives to improve availability, scalability, and performance
Key Requirements (Must-Have Skills)
- Strong programming skills in at least one of the following: Python, JavaScript (Node.js), or Java
- Hands-on experience with Kubernetes (deployment, operations, monitoring)
- Strong experience with observability tools, especially Datadog (preferred), Prometheus, and Grafana
- Experience with API integrations and working with distributed systems
- Solid understanding of monitoring, logging, and distributed tracing concepts
- Experience with AWS cloud services and cloud-native architectures
- Experience integrating observability into CI/CD pipelines
- Strong automation skills using scripting and infrastructure tooling
- Demonstrated experience in owning production systems/platforms, ensuring reliability and performance
Strongly Preferred
- Experience operating or owning an internal engineering or observability platform
- Proven track record of improving system reliability, scalability, and performance proactively
- Experience managing Datadog agents, API keys, access controls, and platform configurations
- Ability to lead incident response, troubleshooting, and performance optimization efforts
- Experience working in cross-functional teams and enabling engineering teams with observability best practices
Nice-to-Have
- Experience with Go (Golang)
- Familiarity with tools like New Relic, Dynatrace, Elastic Observability, or Splunk
- Knowledge of security and access management best practices in observability platforms
- Experience working in distributed microservices environments at scale.
About PeopleTree Knowledge Services
PeopleTree Knowledge Services
people-tree.com
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free