Skip to main content
Trust3 AI logo

Principal Site Reliability Engineer (SRE)

Trust3 AI
Full Timeprincipal
INPosted April 15, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonBashAWSGCPAzureKubernetesTerraformGitHub ActionsSnowflakeBigQueryGitHubGitLabCI/CDDevOps

Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score

Job Description

Role Overview:

As a Principal Site Reliability Engineer (SRE) focusing on Data Platforms, you will be responsible for owning the reliability, support, and operations of enterprise data platforms such as Trust3 AI, Snowflake, and Databricks, with a primary emphasis on Google Cloud Platform (GCP). This role is deeply hands-on and requires a combination of managed services ownership, advanced production engineering, and reliability at scale.

Key Responsibilities:

  • Own the end-to-end platform lifecycle and delivery of managed services, including installation, operations, upgrades, optimization, and ensuring continuous platform health
  • Take complete ownership of critical production incidents by conducting deep debugging, Root Cause Analysis (RCA), and implementing permanent fixes
  • Troubleshoot complex, cross-system issues across GCP (GKE, IAM, networking), data platforms, and connectors
  • Lead performance tuning, scalability optimization, and system hardening for high-throughput systems
  • Design and implement automation across deployments, monitoring, and operations
  • Manage secrets and secure integrations using Vault (or similar) within the platform and CI/CD workflows
  • Install, upgrade, and operate Trust3 AI on GCP (GKE) across multi-region environments
  • Ensure accurate and reliable enforcement of data access policies
  • Build and enhance observability (metrics, logs, alerts) for proactive issue detection
  • Eliminate operational toil through continuous reliability improvements
  • Own issues end-to-end with strong stakeholder communication and adherence to SLAs
  • Collaborate with Engineering and Product teams to resolve issues and influence platform improvements
  • Lead managed services operations including monitoring, incident prevention, capacity planning, DR readiness, and ensuring service-level outcomes (SLA, uptime, upgrade timelines)

Qualifications Required:

  • Cloud expertise in GCP (GKE, IAM, BigQuery, GCS, VPC, Cloud Monitoring/Logging); exposure to AWS/Azure is a plus
  • Familiarity with data platforms such as Snowflake, Databricks, and BigQuery
  • Experience with infrastructure & CI/CD tools like Kubernetes, Helm, CI/CD (GitHub Actions, GitLab CI, or similar), and Terraform (preferred)
  • Proficiency in scripting languages like Python and Bash
  • Knowledge of observability tools like Prometheus, Grafana, and ELK
  • Understanding of security concepts including IAM, RBAC/ABAC, data governance (Trust3 AI/Ranger preferred), and secrets management (Vault or similar)

Additional Company Details:

This role is production-oriented, focusing on end-to-end incident ownership, deep technical problem-solving, and managed services operations. It does not primarily involve DevOps/build-only tasks or people management responsibilities. Role Overview:

As a Principal Site Reliability Engineer (SRE) focusing on Data Platforms, you will be responsible for owning the reliability, support, and operations of enterprise data platforms such as Trust3 AI, Snowflake, and Databricks, with a primary emphasis on Google Cloud Platform (GCP). This role is deeply hands-on and requires a combination of managed services ownership, advanced production engineering, and reliability at scale.

Key Responsibilities:

  • Own the end-to-end platform lifecycle and delivery of managed services, including installation, operations, upgrades, optimization, and ensuring continuous platform health
  • Take complete ownership of critical production incidents by conducting deep debugging, Root Cause Analysis (RCA), and implementing permanent fixes
  • Troubleshoot complex, cross-system issues across GCP (GKE, IAM, networking), data platforms, and connectors
  • Lead performance tuning, scalability optimization, and system hardening for high-throughput systems
  • Design and implement automation across deployments, monitoring, and operations
  • Manage secrets and secure integrations using Vault (or similar) within the platform and CI/CD workflows
  • Install, upgrade, and operate Trust3 AI on GCP (GKE) across multi-region environments
  • Ensure accurate and reliable enforcement of data access policies
  • Build and enhance observability (metrics, logs, alerts) for proactive issue detection
  • Eliminate operational toil through continuous reliability improvements
  • Own issues end-to-end with strong stakeholder communication and adherence to SLAs
  • Collaborate with Engineering and Product teams to resolve issues and influence platform improvements
  • Lead managed services operations including monitoring, incident prevention, capacity planning, DR readiness, and ensuring service-level outcomes (SLA, uptime, upgrade timelines)

Qualifications Required:

  • Cloud expertise in GCP (GKE, IAM, BigQuery, GCS, VPC, Cloud Monitoring/Logging); exposure to AWS/Azure is a plus
  • Familiarity with data platforms such as Snowflake, Databricks, and BigQuery
  • Experience with infrastructure & CI/CD tools like Kubernetes, Helm, CI/CD (GitHub Actions, GitLab CI, or similar), and Terraform (pre

Want AI-powered job matching?

Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.

Get Started Free