Site Reliability Engineer
Humankind Global RecruitmentResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Our client a dynamic Information Technology services company that partners with leading global organizations to deliver innovative, high-quality IT solutions is looking for a Site Reliability Engineer.
As a Site Reliability Engineer (SRE), you will be responsible for ensuring the reliability, availability, and performance of our mission-critical applications and systems. You will be part of a team that bridges the gap between development and operations, leveraging your deep technical expertise and problem-solving skills to implement best practices in infrastructure automation, monitoring, scaling, and incident response.
As member of the team you will mentor and guide more junior SREs, work with cross-functional teams, and drive improvements to systems and processes. A passion for building highly resilient, scalable, and efficient systems is key to success in this role.
The SRE will play a key role helping Team Leads and Senior SRE to cover the gap between the organization as a customer and the team as a Service Provider. Is expected from the SRE being able to lead/mentor/inspire people while can deliver superb technical knowledge to troubleshoot or improve systems.
Responsibilities
Reliability & Availability:
- Maintain and improve system reliability, uptime, and performance across production environments.
- Set and track service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
- Drive improvements in incident response processes, ensuring systems are fault-tolerant and highly available.
Automation & Infrastructure:
- Design, implement, and maintain automation tools to deploy and manage infrastructure at scale.
- Collaborate with software engineering teams to integrate reliability practices into CI/CD pipelines.
- Improve the scalability, resilience, and efficiency of cloud infrastructure.
Monitoring & Observability:
- Implement and maintain monitoring systems and alerts to ensure proactive identification of issues.
- Define key performance metrics and implement logging, monitoring, and alerting solutions across all services and platforms.
Incident Management & Root Cause Analysis:
- Lead and participate in incident response efforts, performing root cause analysis and post-mortems to prevent recurrence.
- Champion a culture of blameless post-mortems and continuously improve incident response playbooks.
- Provide technical leadership and mentorship to junior SREs and other team members.
- Work closely with engineering teams to ensure reliability is a key consideration in application design and development.
- Foster a culture of collaboration between development, operations, and SRE teams to ensure continuous improvement in service reliability.
- Advocate for and implement changes to improve performance, reduce toil, and optimize resource utilization.
- Drive the evolution of operational tooling and processes to enhance the quality of service provided to customers.
Qualifications
7+ year’s experience working as a Site Reliability Engineer is required.
Infrastructure Automation & Configuration Management:
- Proficiency in infrastructure-as-code (IaC) and automation tools such as Terraform, Ansible, AWX.
- Knowledge of KVM Hypervisor
- Experience with containerization technologies like Docker and Kubernetes.
- Knowledge of cloud platforms such as AWS (specially S3), Google Cloud, or Azure is a plus.
Monitoring & Observability:
- Hands-on experience with monitoring tools such as Zabbux, Prometheus, Grafana, , etc.
- Strong understanding of logging and tracing technologies (e.g., ELK stack, Fluentd, OpenTelemetry).
- Experience with Redis, RabbitMQ
Distributed Systems & Networking:
- Solid understanding of distributed system design principles (CAP theorem, eventual consistency, etc.).
- Familiarity with network protocols and debugging tools (e.g., TCP/IP, HTTP/HTTPS, DNS, SMTP, load balancing).
- Solid knowledge of HTTP/HTTPS related services like Apache WebServer, HAProxy, Proxies HTTP/HTTPS and Mail services (PostFix, StrongMail)
- Familiarity with distributed concepts like GEO DNS, CLB (Cloud Load Balancing), etc.
- Deep knowledge of operational security (Service hardening, WAF, Honeypots, SIEMs )
- Mastering IPv4 fundamentals (including basic knowledge of routing protocols like BGP and OSPF). IPv6 experience is a plus
Incident Management & Root Cause Analysis:
- Proven ability to lead post-incident reviews and write detailed post-mortems.
- Experience with incident management tools.
CI/CD & DevOps Practices:
- Experience with CI/
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free