Datacenter Observability and Site Reliability Engineer | Chennai, IN
Macpower Digital Assets EdgeResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Datacenter Observability and Site Reliability Engineer
Job Summary: We are seeking a skilled Observability & Site Reliability Engineer to join our team in supporting large-scale, enterprise-grade infrastructure. The ideal candidate will have extensive experience with observability tools-especially Grafana, Loki, Mimir , and Kubernetes metrics/logs -along with a strong passion for performance, scalability, and system uptime. Candidates must be flexible to collaborate with Korean stakeholders and work within the Korean time zone.
- Experience: 8 to 12 years.
- Notice Period: Immediate to 30 days preferred.
Key Must-Have Skills:
- 5+ years in Observability Engineering.
- Expertise in Grafana, Loki, Mimir, and lloy agent.
- Strong understanding of infrastructure metrics (e.g., GPU, CPU, Kubernetes).
- Proficiency in scripting languages ( Python, Go, Bash).
- Prior exposure to tools such as Prometheus, ELK, Docker, and Terraform.
- Flexibility to work with Korean stakeholders and time zones.
Role Highlights:
- Design and manage the observability stack across large-scale data center infrastructure.
- Build scalable telemetry systems, dashboards, alerts, and reports.
- pply SRE best practices to ensure system reliability and performance.
- Troubleshoot real-time issues and contribute to ongoing system optimization.
Good to Have:
- Previous experience working with Korean stakeholders.
- Familiarity with cloud platforms like WS, GCP, or zure.
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free