Production Support/Site Reliability Engineer

Aarorn Technologies Inc.

Full Timemid

Toronto, Ontario, CAPosted 7 weeks ago

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

PythonGoBashShellAzureKubernetesTerraformAnsibleJenkinsLinuxMongoDBRedisElasticsearchGitLabCI/CDDevOps

Job Description

Role: Site Reliability Engineer - Production Support

Rate Max for $50/hr.

Position Overview

seeks a skilled and experienced Production Support Engineer through vendor staffing to support our digital applications. This role combines hands-on production support with Site Reliability Engineering (SRE) principles, focusing on toil elimination, infrastructure automation, and ensuring high availability of critical digital applications and backend systems.

Primary Responsibilities

1. Toil Removal & Infrastructure Maintenance (15%)

Execute SSL/TLS certificate updates and renewals across production environments
Perform Windows and Linux server patching and security updates
Manage NPID password updates and credential rotation protocols
Implement security vulnerability remediation in production systems
Identify, document, and eliminate repetitive manual operational tasks

2. Infrastructure & Database Cluster Management (20%)

Manage and support Elasticsearch cluster operations (deployment, scaling, monitoring, troubleshooting, performance tuning)
Administer MongoDB clusters including replication, sharding, backup, recovery, and maintenance
Operate and maintain Redis instances for caching and session management
Monitor cluster health, capacity planning, and optimization
Execute failover and disaster recovery procedures
Ensure data integrity and backup compliance

3. Automation & SRE Activities (15%)

Develop, maintain, and enhance Ansible playbooks for infrastructure automation
Build infrastructure-as-code solutions to reduce manual intervention
Create and maintain comprehensive runbooks and operational playbooks
Design monitoring, alerting, and observability solutions
Implement automated remediation for common operational issues
Quantify and prioritize toil reduction opportunities

4. Production Application Support (50%)

Troubleshoot and resolve production incidents affecting digital applications
Collaborate with application development and support teams on issue diagnosis
Participate in incident response, root cause analysis, and post-mortems
Monitor and respond to application performance degradation

---

Technical Requirements

Required Expertise (Must-Have)

Ansible: 2+ years hands-on experience writing playbooks, roles, and automation workflows
Elasticsearch: 2+ years managing and troubleshooting Elasticsearch clusters in production
MongoDB: 2+ years with replica sets, sharding, backup/recovery, and performance tuning
Redis: Proficiency in deployment, configuration, and operational support
OpenShift: Experience deploying and managing containerized applications on OpenShift
Azure: Knowledge of Azure cloud services, resource management, and deployments
Linux Administration: 3+ years with RHEL, CentOS, or Ubuntu in production environments
Windows Server Administration: Experience with patching, certificate management, and maintenance
Shell Scripting: Bash scripting for automation and operational tasks
Incident Management: Experience responding to and resolving critical production incidents

Preferred Skills

Kubernetes or container orchestration platforms
Python or Go scripting for automation
CI/CD pipeline experience (Jenkins, GitLab CI, Azure DevOps)
Monitoring and observability tools (Prometheus, Grafana, ELK Stack, Datadog)
Infrastructure-as-Code tools (Terraform, CloudFormation)
Security best practices and vulnerability management
Relevant certifications (AZ-900, CKA, Elasticsearch, etc.)

---

Required Qualifications

Minimum 5 years of production infrastructure support or SRE experience
Minimum 3 years with at least 2 of the core technologies (Elasticsearch, MongoDB, Ansible, OpenShift)
Experience working in regulated financial services environment (preferred)
Ability to work independently and in teams
Strong troubleshooting and analytical capabilities
Excellent documentation and communication skills
Must be available for on-call support rotation (with reasonable notice)

---

Operational Expectations

On-Call Rotation: Participates in production support on-call schedule
Incident Response: Available for critical incident resolution outside standard business hours as required
Availability: Core business hours + flexibility for critical production issues
Response Time: First response to critical incidents within 30 minutes
Documentation: Maintains detailed runbooks, playbooks, and knowledge base articles
Collaboration: Regular communication with infrastructure, development, and operations teams