Senior Site Reliability Engineer
Devopie Inc.Resume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
💡 What You’ll Do
You’ll operate at the intersection of software engineering and systems engineering , building resilient systems that scale, self-heal, and empower developers to ship safely.
🔎 Reliability Engineering
- Define and manage SLIs, SLOs, and error budgets
- Reduce MTTD, MTTA, and MTTR through structured incident response
- Conduct blameless postmortems and drive preventative improvements
- Champion reliability in architectural reviews and production readiness
📊 Observability & Monitoring
- Design actionable, symptom-based alerts (not noise)
- Build dashboards and tracing systems using tools like CloudWatch, Prometheus, Grafana, New Relic, X-Ray, ADOT
- Implement synthetic monitoring to simulate real user journeys (URLs, clickpaths, APIs)
- Ensure full observability coverage across critical paths
☁️ Cloud & Infrastructure
- Operate and optimize AWS environments (EC2, EKS/ECS, Lambda, VPC, RDS, IAM, S3, ALB/NLB, CloudTrail)
- Build resilient, multi-AZ and regionally replicated systems
- Implement autoscaling and fault-tolerant architecture
- Leverage Infrastructure as Code (Terraform, CDK, CloudFormation)
🤖 Automation & Toil Reduction
- Eliminate manual processes through automation
- Build self-healing infrastructure
- Improve CI/CD pipelines with safe deployment strategies (canary releases, feature flags)
- Write production-quality code (not just scripts) in Python, Go, Ruby, Bash, or Java
📈 Performance & Capacity Planning
- Analyze system metrics and traffic patterns
- Conduct load testing, chaos testing, and capacity modeling
- Identify bottlenecks and proactively optimize systems
🤝 Cross-Functional Collaboration
You’ll work closely with:
- Engineering & Platform teams on scalable system design
- Security teams on IAM, KMS, GuardDuty, secrets management
- Product leaders to align reliability with roadmap priorities
- Cloud vendors and SaaS providers during critical incidents
🧠 What You Bring
Must-Have Experience
- Bachelor’s degree in Computer Science, Software Engineering, or related field
- Strong Linux/Unix systems knowledge
- Deep AWS experience
- Hands-on Kubernetes (EKS/ECS), Docker, and container orchestration
- Infrastructure as Code (Terraform, CDK, CloudFormation)
- Production on-call and incident management experience
- Strong understanding of MTTx metrics (MTTD, MTTR, MTBF, etc.)
- Experience with MongoDB, PostgreSQL, Redis, RabbitMQ
- Experience with observability and monitoring platforms
- CI/CD pipeline experience (GitHub, Kubernetes, etc.)
Nice-to-Have
- Performance engineering and chaos testing
- Experience in fintech or regulated environments
- Knowledge of distributed storage systems (NFS, HDFS, Ceph, S3)
- Familiarity with dynamic resource frameworks (Kubernetes, Mesos, Yarn)
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free