Job Description
Site Reliability Engineer (SRE)
Location: Vienna, VA (4 days onsite, 1 day remote)
Type: Full-time / Contract
Job Summary
We are looking for an experienced Site Reliability Engineer (SRE) to support and monitor critical production systems. The role focuses on 24x7 monitoring, reducing manual work through automation, managing Splunk, and supporting cloud-based Disaster Recovery and Business Continuity processes. You will work closely with Cloud, DevOps, and Application teams to ensure system reliability and availability.
Key Responsibilities
- Provide 24x7 production monitoring and support for critical systems.
- Meet SLAs and follow SRE best practices to reduce manual remediation (toil).
- Build automated remediation solutions to improve system stability.
- Administer and configure Splunk for monitoring and troubleshooting.
- Support gradual changes, application monitoring, and automation tasks.
- Participate in Business Continuity, Disaster Recovery (DR), and COOP activities.
- Perform system failover/switchover testing (Cold/Warm/Hot).
- Ensure high availability through fault tolerance, redundancy, and five 9s design.
- Monitor and resolve system data synchronization issues.
Required Skills & Experience
- Bachelor’s degree in Computer Science or related field.
- 6+ years of SRE experience in production environments.
- Strong experience with Splunk administration and configuration.
- Hands‑on experience with DR, COOP, Business Continuity on cloud platforms.
- Good understanding of reliability engineering concepts (HA, redundancy, failover).
- Strong troubleshooting, problem‑solving, and communication skills.
- Ability to work in a collaborative team environment.
Seniority level
Mid‑Senior level
Employment type
Contract
Job function
Information Technology
Industries
IT Services and IT Consulting
#J-18808-Ljbffr
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free