Resume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
At BuildOps, we’re building a software platform that empowers today’s commercial contractors. From service management to project execution, we’re reimagining how our customers operate. Our team thrives on ambition, innovation, and collaboration – qualities we look for in every new hire.
You will join our cloud infrastructure and reliability engineering team as a Site Reliability Engineer (SRE). Your primary responsibility will be to improve and protect the reliability, performance, and operability of our production systems while helping evolve our AWS-based infrastructure. We’re looking for someone with a strong SRE mindset, solid software engineering fundamentals, and deep observability expertise who can work effectively in a distributed team environment.
Reporting to the DevOps and SRE Manager, this is a hands-on, staff-level role where you will influence reliability strategy, build tooling and automation, and contribute directly to day-to-day operations in a fast-moving, industry-defining company.
What You’ll Do
- Own one or more reliability domains end-to-end (for example observability, incident management workflows, performance of key surfaces, or core platform readiness), including strategy, roadmap, and execution
- Drive and refine modern SRE practices across services, including SLIs/SLOs, error budgets, and reliability reviews
- Lead multi-sprint, multi-engineer reliability or performance initiatives, coordinating work across teams and driving them to successful completion
- Design and maintain end-to-end observability (metrics, logs, traces, dashboards, and alerts) so teams can quickly detect, debug, and prevent issues
- Act as a subject-matter expert in at least one reliability area (for example observability, incident management, performance engineering, or search/data platforms), helping other teams make good design and operational decisions
- Partner with product and engineering teams to design reliable services—reviewing architectures, failure modes, rollout strategies, and capacity/latency considerations—and influence system design toward reliability and performance goals
- Help evolve and operate our AWS infrastructure (networking, compute, data stores) in collaboration with infrastructure experts, working within Infrastructure as Code workflows
- Contribute code to services, tooling, and automation (for example reliability libraries, deployment and incident tooling, health checks), and use LLMs/AI tools to accelerate high-quality delivery
- Define, implement, and iterate on SLIs, SLOs, and error budgets with service owners, and use them to guide reliability work and release decisions
- Participate in production on-call rotations and incident response for high-severity issues, including learning-focused post-incident reviews and follow-through on action items
- Develop runbooks, safeguards, and automation that reduce manual work, improve time-to-diagnosis, and standardize responses to recurring scenarios
- Collaborate with engineering and product leadership to prioritize reliability, performance, and operability work alongside feature delivery
- Document standards, playbooks, and best practices so reliability improvements scale across teams
- Mentor other SREs and software engineers in reliability-minded design, observability, incident response, and pragmatic use of SRE practices
- Help build systems, automation, and team practices that reduce reliance on heroics and ad-hoc firefighting
What We Look For
- 8+ years of experience operating complex, user-facing SaaS systems and working on production systems and reliability-focused initiatives
- Proven experience leading multi-sprint, multi-engineer projects (for example reliability, performance, or infrastructure initiatives) to successful completion with clear business impact
- Experience leading at least one org-wide or multi-team reliability or performance initiative from definition through rollout and follow-through on improvements
Thorough understanding of, and hands-on experience with, modern SRE practices, such as:
•
- Defining and implementing SLIs/SLOs and error budgets
- Reducing toil through automation
- Safe deployment and rollout patterns
- Structured post-incident reviews and continuous improvement
- Strong software engineering skills: you’ve written and maintained production-quality code and can work comfortably in at least one modern language (for example Python or Node.js/TypeScript)
- You regularly use LLMs and AI-assisted tooling in your workflow and know how to validate and improve what they generate
- Deep expertise in at least one reliability-related domain, such as observability, incident management, performance engineering, or large-scale data/search platforms
Strong observability skills, including:
•
- Designing metrics, logging, and tracing for multi-service systems
- Building actionable dashboards and alerts with clear runbooks
- Correlating metrics, logs, and traces to debug complex issues
- Experience with tools s
Similar Jobs
Entry Level Software Engineer w/ Java at Onyx Point, Inc. Hanover, MD
Itlearn360
ETL Developer (SSIS & Healthcare Domain) (Delhi)
Blutic
Technical Account Manager - Infrastructure Security
adaptive
OCI/GCP Cloud Engineer Ops Support
Leidos
SQL Database Developer/Data Modelling
Synechron
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free