Principal Site Reliability Engineer (sre)
BIG IT JOBSResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Principal Site Reliability Engineer (sre)
Privacera
3 hours ago
Expires On17 May 2026
Pune City, Maharashtra, India
Apply now
Job description & requirements
Role: Principal Site Reliability Engineer (SRE) – Data Platforms
Role Summary
Own reliability, support, and operations of enterprise data platforms (Trust3 AI, Snowflake, Databricks)
with a primary focus on Google Cloud Platform (GCP). This is a deeply hands-on Principal SRE role
combining managed services ownership, advanced production engineering, and reliability at scale.
What You’ll Do
- Own end-to-end platform lifecycle and managed services delivery: installation, operations,
upgrades, optimization, and continuous platform health
- Take full ownership of critical production incidents with deep debugging, RCA, and permanent fixes
- Troubleshoot complex, cross-system issues across GCP (GKE, IAM, networking), data platforms, and connectors
- Lead performance tuning, scalability optimization, and system hardening for high-throughput systems
- Design and implement automation across deployments, monitoring, and operations
- Manage secrets and secure integrations using Vault (or similar) within platform and CI/CD workflows
- Install, upgrade, and operate Trust3 AI on GCP (GKE) across multi-region environments
- Ensure accurate and reliable enforcement of data access policies
- Build and enhance observability (metrics, logs, alerts) for proactive issue detection
- Eliminate operational toil through continuous reliability improvements
- Own issues end-to-end with strong stakeholder communication and SLA adherence
- Collaborate with Engineering and Product to resolve issues and influence platform improvements
- Lead managed services operations including monitoring, incident prevention, capacity planning,
DR readiness, and service-level outcomes (SLA, uptime, upgrade timelines)
Skills Required
- Cloud: Strong expertise in GCP (GKE, IAM, BigQuery, GCS, VPC, Cloud Monitoring/Logging); AWS/Azure exposure is a plus
- Data Platforms: Snowflake, Databricks, BigQuery
- Infra & CI/CD: Kubernetes, Helm, CI/CD (GitHub Actions, GitLab CI, or similar), Terraform (preferred)
- Scripting: Python / Bash
- Observability: Prometheus, Grafana, ELK
- Security: IAM, RBAC/ABAC, data governance (Trust3 AI/Ranger preferred), secrets management (Vault or similar)
Experience
- 10+ years in SRE / DevOps / Production Engineering
- Strong expertise in debugging distributed systems and complex production environments
- Proven ownership of high-severity incidents and large-scale production systems
- Demonstrated ability to independently solve ambiguous, high-impact technical problems
- Track record of driving reliability, automation, and operational excellence at scale
- Experience running high-throughput, always-on (24x7) systems with large data volumes and strict uptime SLAs
Why This Role
- Principal-level, deeply hands-on IC role (no people management)
- End-to-end ownership of mission-critical data platforms
- Work on complex production challenges across cloud, data, and security layers
- High impact on enterprise data access, governance, and reliability
Important Note
This is a production-first role involving end-to-end incident ownership, deep technical problem solving,
and managed services operations — not a pure DevOps/build-only or people management role.
Location :
Pune City, Maharashtra, India
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free