Lead Site Reliability Engineer, fully remote
ConcentrixResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
About the Role : As a Lead Site Reliability Engineer, you will own the reliability and availability of our production systems. You will champion SRE principles across engineering teams — defining SLOs, managing error budgets, and leading a culture of blameless incident response. This is a hands-on leadership role where you will partner closely with product and engineering teams to balance the pace of innovation with the stability our customers depend on.
Title: Site Reliability Engineer
Shift- General/UK Shift
India, Remote Any location near CNX offices
Use error budget policies to drive data-informed conversations between engineering and product on release velocity vs. Conduct capacity planning and proactive risk assessments to prevent incidents before they occur.
- Incident Management
- · Lead incident response as incident commander — coordinating teams, driving resolution, and maintaining clear stakeholder communication during outages.
- · Develop and continuously improve runbooks, escalation paths, and on-call practices to reduce MTTD and MTTR.
- Observability & Monitoring
- · Design and maintain observability strategies using modern tooling (Prometheus, Grafana, OpenTelemetry, ELK) to ensure full visibility into system health.
- · Identify and measure toil across the engineering organization and lead initiatives to eliminate it through automation.
- · Collaborate with platform and infrastructure teams on cloud-native patterns for fault tolerance, auto-scaling, and disaster recovery.
- Provide SRE input into CI/CD pipelines and deployment strategies (e.g., canary releases, blue/green deployments) to minimize production risk.
- Act as an SRE advocate across engineering — embedding reliability thinking into the software development lifecycle.
- AI Expectations
- As with all engineers at our organization, this role requires an AI-native mindset. Embed AI tools and practices into how we build and run our platform — deploying AI-powered capabilities and shipping real AI features into production.
- · Support engagement and solutioning for AI-powered offerings, translating technical capabilities into tangible business value.
- · Collaborate with cross-functional partners — including Product, Data, Security, and Legal — to ensure AI is delivered safely, effectively, and in compliance with relevant standards.
- 7+ years of experience in SRE, platform engineering, or a related discipline.
- Proven experience defining and managing SLOs, SLIs, and error budgets in a production environment.
- Strong incident management experience, including leading postmortems and driving reliability improvements.
- Hands-on experience with observability tooling (Prometheus, Grafana, OpenTelemetry, or similar).
- Solid understanding of cloud platforms (AWS, Azure, or GCP) and containerized environments (Kubernetes).
- Proficiency in at least one scripting or programming language (Python, Go, or Bash).
- Experience with chaos engineering tools (e.g., Experience with GitOps workflows and CI/CD pipelines.
- Bilingual proficiency (English & Spanish).
- Complete all assigned, mandatory training within the timeframe provided.
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free