Production Support/Site Reliability Engineer
Aarorn Technologies Inc.Resume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Role: Site Reliability Engineer - Production Support
Rate Max for $50/hr.
Position Overview
seeks a skilled and experienced Production Support Engineer through vendor staffing to support our digital applications. This role combines hands-on production support with Site Reliability Engineering (SRE) principles, focusing on toil elimination, infrastructure automation, and ensuring high availability of critical digital applications and backend systems.
Primary Responsibilities
1. Toil Removal & Infrastructure Maintenance (15%)
- Execute SSL/TLS certificate updates and renewals across production environments
- Perform Windows and Linux server patching and security updates
- Manage NPID password updates and credential rotation protocols
- Implement security vulnerability remediation in production systems
- Identify, document, and eliminate repetitive manual operational tasks
2. Infrastructure & Database Cluster Management (20%)
- Manage and support Elasticsearch cluster operations (deployment, scaling, monitoring, troubleshooting, performance tuning)
- Administer MongoDB clusters including replication, sharding, backup, recovery, and maintenance
- Operate and maintain Redis instances for caching and session management
- Monitor cluster health, capacity planning, and optimization
- Execute failover and disaster recovery procedures
- Ensure data integrity and backup compliance
3. Automation & SRE Activities (15%)
- Develop, maintain, and enhance Ansible playbooks for infrastructure automation
- Build infrastructure-as-code solutions to reduce manual intervention
- Create and maintain comprehensive runbooks and operational playbooks
- Design monitoring, alerting, and observability solutions
- Implement automated remediation for common operational issues
- Quantify and prioritize toil reduction opportunities
4. Production Application Support (50%)
- Troubleshoot and resolve production incidents affecting digital applications
- Collaborate with application development and support teams on issue diagnosis
- Participate in incident response, root cause analysis, and post-mortems
- Monitor and respond to application performance degradation
---
Technical Requirements
Required Expertise (Must-Have)
- Ansible: 2+ years hands-on experience writing playbooks, roles, and automation workflows
- Elasticsearch: 2+ years managing and troubleshooting Elasticsearch clusters in production
- MongoDB: 2+ years with replica sets, sharding, backup/recovery, and performance tuning
- Redis: Proficiency in deployment, configuration, and operational support
- OpenShift: Experience deploying and managing containerized applications on OpenShift
- Azure: Knowledge of Azure cloud services, resource management, and deployments
- Linux Administration: 3+ years with RHEL, CentOS, or Ubuntu in production environments
- Windows Server Administration: Experience with patching, certificate management, and maintenance
- Shell Scripting: Bash scripting for automation and operational tasks
- Incident Management: Experience responding to and resolving critical production incidents
Preferred Skills
- Kubernetes or container orchestration platforms
- Python or Go scripting for automation
- CI/CD pipeline experience (Jenkins, GitLab CI, Azure DevOps)
- Monitoring and observability tools (Prometheus, Grafana, ELK Stack, Datadog)
- Infrastructure-as-Code tools (Terraform, CloudFormation)
- Security best practices and vulnerability management
- Relevant certifications (AZ-900, CKA, Elasticsearch, etc.)
---
Required Qualifications
- Minimum 5 years of production infrastructure support or SRE experience
- Minimum 3 years with at least 2 of the core technologies (Elasticsearch, MongoDB, Ansible, OpenShift)
- Experience working in regulated financial services environment (preferred)
- Ability to work independently and in teams
- Strong troubleshooting and analytical capabilities
- Excellent documentation and communication skills
- Must be available for on-call support rotation (with reasonable notice)
---
Operational Expectations
- On-Call Rotation: Participates in production support on-call schedule
- Incident Response: Available for critical incident resolution outside standard business hours as required
- Availability: Core business hours + flexibility for critical production issues
- Response Time: First response to critical incidents within 30 minutes
- Documentation: Maintains detailed runbooks, playbooks, and knowledge base articles
- Collaboration: Regular communication with infrastructure, development, and operations teams
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free