Site Reliability Engineer - Transport Systems

LEIDOS

Be an Early ApplicantFull Timemid

Posted Today

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonSwiftBashAWSAzureDockerKubernetesTerraformAnsibleJenkinsLinuxUnixGitLabJiraAgileCI/CDDevOps

Job Description

Description

The NMCI Service Management Integration and Transport (SMIT) group at Leidos is seeking a dedicated Site Reliability Engineer (SRE) to enhance the reliability, performance, and scalability of complex distributed systems. As part of the SMIT Contract, the Leidos team plays a critical role in maintaining the Navy-Marine Corps Intranet's core infrastructure, including cybersecurity services, network operations, network engineering, service desk support, and data transport.

In this role, you will develop and execute tests to assess system resilience, performance under load, and failure scenarios. Collaborating with fellow SREs and development teams, you will create automated testing frameworks that simulate real-world conditions and validate system behavior under various scenarios, ensuring our services remain robust and meet established service level objectives (SLOs). Your contributions will be vital in developing resilient and scalable services that operate reliably in production environments.

Your key responsibilities will include maintaining intricate computer systems by writing automated scripts for software releases, system monitoring, and issue detection and resolution before users are affected. Your expertise will be essential in enhancing site performance and overall system reliability.

The SRE Engineer will support and optimize software development and deployment processes, implement infrastructure as code, and elevate the overall maturity of the Site Reliability Engineering program.

Primary Responsibilities

Collaborate with development and operations teams to ensure swift and reliable software deployments, actively monitoring systems and enhancing platform reliability. Proactively document and rectify system bugs.
Utilize AI coding tools and scripts to automate, scale, test, and secure cloud infrastructure and pipelines.
Enhance system performance monitoring using Splunk or other dashboard tools.
Identify performance bottlenecks and optimize cloud infrastructure accordingly.
Advise on improving engineering build, maintenance, automation, and reliability across the platform utilizing SRE/DevOps tools and Infrastructure-as-Code.
Create and code high-quality automation workflows for pipeline support aligned with business and technology strategies.
Design and execute test strategies replicating real-world failure scenarios such as network disruptions and system overloads.
Develop and run performance tests to assess system behavior under varying loads and traffic conditions, identifying bottlenecks and areas for improvement.
Build automated testing suites for infrastructure and application components, integrating into the CI/CD pipeline to validate system reliability with each release.
Establish automated systems for continuous performance, stress, and load testing.
Work closely with SREs, developers, and operations teams to define reliability goals and develop effective testing strategies.
Ensure thorough performance and reliability testing of new services and features prior to production deployment.
Verify that monitoring, logging, and alerting systems operate correctly during failure conditions.
Measure and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) through automated frameworks.
Independently resolve conflicts related to timelines, budgets, and scope, escalating significant issues to senior management as necessary.

Basic Qualifications

Bachelor's degree preferred; 4 - 8 years of relevant experience may be considered in lieu of a degree.
Active DoD Secret security clearance is required and must be maintained.
DoD 8570.01 IAT Level II Certification is required prior to onboarding and must be maintained for the duration of the SMIT Contract.
Ability to support operations in classified environments and access SIPRNet from an NMCI location on short notice (local travel required).
5+ years of experience with Cisco routers, switches, and network appliances.
5+ years of experience with routing protocols (e.g., OSPF, EIGRP, BGP).
5+ years of experience with L2 switching technologies (e.g., VLANs, spanning tree, VTP).
5+ years of troubleshooting experience with complex routing and switching issues.
Exposure to multiple vendor routing, switching, and wireless product lines.
Strong knowledge of TCP/IP network addressing and subnetting.
Support network configuration and asset management, managing configuration drift and ensuring accurate documentation for the current and future state.
Ability to work both independently and in a team to resolve technical issues in a dynamic work environment.
Experience with scripting languages (e.g., bash, python) for automation preferred.
Familiarity with CI/CD toolsets (e.g., Jenkins, GitLab).
Experience with containerization tools like Docker and orchestration technologies such as Kubernetes.
Strong command of Linux/Unix environments.
Experience in application administration, configuration, and integration.
Familiarity with agile development methodologies and best practices in SRE/DevSecOps.
Proficient in working with distributed teams, displaying a collaborative and innovative approach.
Experience with Atlassian products (Jira, Confluence, Bitbucket, etc.) and creating custom configurations within JIRA or Azure DevOps.
Experience maintaining the SRE platform via Ansible playbooks.
Proficient in automating tasks using scripting languages like PowerShell or Python.
Experience integrating and maintaining third-party CI/CD tools like Jenkins and GitLab.
Hands-on experience with PaaS using Red Hat OpenShift, Kubernetes, and Docker containers.
Experience deploying commercial cloud infrastructure in environments like AWS and Azure.
Familiarity with automated provisioning tools such as Terraform, CloudFormation, Chef, Puppet, or Ansible.
Understanding of the Risk Management Framework (RMF) and DISA STIGs.

Preferred Qualifications

Previous support experience within the NGEN-NMCI program.
Expertise with Infrastructure as Code (IaC) tools like Terraform, Ansible, or CloudFormation for automating test environments.

Your mission to transform the status quo starts here at Leidos. We’re seeking innovators eager to challenge expectations and achieve excellence in reliability engineering.