Datacenter Observability and Site Reliability Engineer
Macpower Digital Assets Edge Private LimitedResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Job Summary: We are seeking a skilled Observability & Site Reliability Engineer to join our team in supporting large-scale, enterprise-grade infrastructure. The ideal candidate will have extensive experience with observability tools especially Grafana, Loki, Mimir, and Kubernetes metrics/logs along with a strong passion for performance, scalability, and system uptime. Candidates must be flexible to collaborate with Korean stakeholders and work within the Korean time zone.
- Experience: 8 to 12 years.
- Notice Period: Immediate to 30 days preferred.
Key Must-Have Skills:
- 5+ years in Observability Engineering.
- Expertise in Grafana, Loki, Mimir, and Alloy agent.
- Strong understanding of infrastructure metrics (e.g., GPU, CPU, Kubernetes).
- Proficiency in scripting languages ( Python, Go, Bash).
- Prior exposure to tools such as Prometheus, ELK, Docker, and Terraform.
- Flexibility to work with Korean stakeholders and time zones.
Role Highlights:
- Design and manage the observability stack across large-scale data center infrastructure.
- Build scalable telemetry systems, dashboards, alerts, and reports.
- Apply SRE best practices to ensure system reliability and performance.
- Troubleshoot real-time issues and contribute to ongoing system optimization.
Good to Have:
- Previous experience working with Korean stakeholders.
- Familiarity with cloud platforms like AWS, GCP, or Azure.
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free