Skip to main content
B

Staff Site Reliability Engineer

BuildOps
Full Timestaff
San Francisco, California, USPosted February 19, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonTypeScriptNode.jsAWSDevOpsSaaS

Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score

Job Description

At BuildOps, we’re building a software platform that empowers today’s commercial contractors. From service management to project execution, we’re reimagining how our customers operate. Our team thrives on ambition, innovation, and collaboration – qualities we look for in every new hire.

You will join our cloud infrastructure and reliability engineering team as a Site Reliability Engineer (SRE). Your primary responsibility will be to improve and protect the reliability, performance, and operability of our production systems while helping evolve our AWS-based infrastructure. We’re looking for someone with a strong SRE mindset, solid software engineering fundamentals, and deep observability expertise who can work effectively in a distributed team environment.

Reporting to the DevOps and SRE Manager, this is a hands-on, staff-level role where you will influence reliability strategy, build tooling and automation, and contribute directly to day-to-day operations in a fast-moving, industry-defining company.

What You’ll Do

  • Own one or more reliability domains end-to-end (for example observability, incident management workflows, performance of key surfaces, or core platform readiness), including strategy, roadmap, and execution
  • Drive and refine modern SRE practices across services, including SLIs/SLOs, error budgets, and reliability reviews
  • Lead multi-sprint, multi-engineer reliability or performance initiatives, coordinating work across teams and driving them to successful completion
  • Design and maintain end-to-end observability (metrics, logs, traces, dashboards, and alerts) so teams can quickly detect, debug, and prevent issues
  • Act as a subject-matter expert in at least one reliability area (for example observability, incident management, performance engineering, or search/data platforms), helping other teams make good design and operational decisions
  • Partner with product and engineering teams to design reliable services—reviewing architectures, failure modes, rollout strategies, and capacity/latency considerations—and influence system design toward reliability and performance goals
  • Help evolve and operate our AWS infrastructure (networking, compute, data stores) in collaboration with infrastructure experts, working within Infrastructure as Code workflows
  • Contribute code to services, tooling, and automation (for example reliability libraries, deployment and incident tooling, health checks), and use LLMs/AI tools to accelerate high-quality delivery
  • Define, implement, and iterate on SLIs, SLOs, and error budgets with service owners, and use them to guide reliability work and release decisions
  • Participate in production on-call rotations and incident response for high-severity issues, including learning-focused post-incident reviews and follow-through on action items
  • Develop runbooks, safeguards, and automation that reduce manual work, improve time-to-diagnosis, and standardize responses to recurring scenarios
  • Collaborate with engineering and product leadership to prioritize reliability, performance, and operability work alongside feature delivery
  • Document standards, playbooks, and best practices so reliability improvements scale across teams
  • Mentor other SREs and software engineers in reliability-minded design, observability, incident response, and pragmatic use of SRE practices
  • Help build systems, automation, and team practices that reduce reliance on heroics and ad-hoc firefighting

What We Look For

  • 8+ years of experience operating complex, user-facing SaaS systems and working on production systems and reliability-focused initiatives
  • Proven experience leading multi-sprint, multi-engineer projects (for example reliability, performance, or infrastructure initiatives) to successful completion with clear business impact
  • Experience leading at least one org-wide or multi-team reliability or performance initiative from definition through rollout and follow-through on improvements

Thorough understanding of, and hands-on experience with, modern SRE practices, such as:

  • Defining and implementing SLIs/SLOs and error budgets
  • Reducing toil through automation
  • Safe deployment and rollout patterns
  • Structured post-incident reviews and continuous improvement
  • Strong software engineering skills: you’ve written and maintained production-quality code and can work comfortably in at least one modern language (for example Python or Node.js/TypeScript)
  • You regularly use LLMs and AI-assisted tooling in your workflow and know how to validate and improve what they generate
  • Deep expertise in at least one reliability-related domain, such as observability, incident management, performance engineering, or large-scale data/search platforms

Strong observability skills, including:

  • Designing metrics, logging, and tracing for multi-service systems
  • Building actionable dashboards and alerts with clear runbooks
  • Correlating metrics, logs, and traces to debug complex issues
  • Experience with tools s

Want AI-powered job matching?

Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.

Get Started Free