Lead Site Reliability Engineer, fully remote

Concentrix

Full Timelead

Thiruvananthapuram, Kerala, INPosted April 22, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonGoBashAWSGCPAzureKubernetesCI/CD

Job Description

About the Role : As a Lead Site Reliability Engineer, you will own the reliability and availability of our production systems. You will champion SRE principles across engineering teams — defining SLOs, managing error budgets, and leading a culture of blameless incident response. This is a hands-on leadership role where you will partner closely with product and engineering teams to balance the pace of innovation with the stability our customers depend on.

Title: Site Reliability Engineer

Shift- General/UK Shift

India, Remote Any location near CNX offices

Use error budget policies to drive data-informed conversations between engineering and product on release velocity vs. Conduct capacity planning and proactive risk assessments to prevent incidents before they occur.

Incident Management
· Lead incident response as incident commander — coordinating teams, driving resolution, and maintaining clear stakeholder communication during outages.
· Develop and continuously improve runbooks, escalation paths, and on-call practices to reduce MTTD and MTTR.
Observability & Monitoring
· Design and maintain observability strategies using modern tooling (Prometheus, Grafana, OpenTelemetry, ELK) to ensure full visibility into system health.
· Identify and measure toil across the engineering organization and lead initiatives to eliminate it through automation.
· Collaborate with platform and infrastructure teams on cloud-native patterns for fault tolerance, auto-scaling, and disaster recovery.
Provide SRE input into CI/CD pipelines and deployment strategies (e.g., canary releases, blue/green deployments) to minimize production risk.
Act as an SRE advocate across engineering — embedding reliability thinking into the software development lifecycle.
AI Expectations
As with all engineers at our organization, this role requires an AI-native mindset. Embed AI tools and practices into how we build and run our platform — deploying AI-powered capabilities and shipping real AI features into production.
· Support engagement and solutioning for AI-powered offerings, translating technical capabilities into tangible business value.
· Collaborate with cross-functional partners — including Product, Data, Security, and Legal — to ensure AI is delivered safely, effectively, and in compliance with relevant standards.
7+ years of experience in SRE, platform engineering, or a related discipline.
Proven experience defining and managing SLOs, SLIs, and error budgets in a production environment.
Strong incident management experience, including leading postmortems and driving reliability improvements.
Hands-on experience with observability tooling (Prometheus, Grafana, OpenTelemetry, or similar).
Solid understanding of cloud platforms (AWS, Azure, or GCP) and containerized environments (Kubernetes).
Proficiency in at least one scripting or programming language (Python, Go, or Bash).
Experience with chaos engineering tools (e.g., Experience with GitOps workflows and CI/CD pipelines.
Bilingual proficiency (English & Spanish).
Complete all assigned, mandatory training within the timeframe provided.