Site Reliability Engineer – GenAI Platform
Astra North Infoteck Inc.Resume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Experience: 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.
Roles and Responsibilities:
- Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
- Design and build automation for core platform capabilities, reducing manual toil
- Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
- Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
- Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
- Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
- Optimize cost vs. performance tradeoffs in large-scale compute environments
- Harden systems for security, compliance, auditability, and data governance
- Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems
- Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms
- Maintain runbooks, operational playbooks, documentation, and training materials
- Participate in on-call rotations and respond to production incidents 24/7 as needed
- Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability
Skills
- Production experience in SRE / Infrastructure / ops for large-scale systems
- Strong programming/scripting skills (Python, Go, Java, or equivalent)
- Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
- Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
- Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
- Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
- Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)
- Solid experience in capacity planning, performance tuning, scaling, and incident response
- Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments
- Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus
- Excellent communication, documentation, and cross-team collaboration skills
- Proven track record of reducing operational toil via automation
Similar Jobs
Remote DevOps Engineer — Cloud, CI/CD & Automation
Publicis Groupe Holdings B.V
Technical Account Manager - Infrastructure Security
adaptive
Software Engineer - Python / AWS (TS/SCI Clearance Required)
North Point Technology
Software Engineer - Python / AWS (TS/SCI Clearance Required)
North Point Technology
SQL Developer / Data Engineer - Banking Projects - Up to $70.00 p/h INC
CorGTA
More Jobs at Astra North Infoteck Inc.
View all →Windows Server administration
Astra North Infoteck Inc.
Site Reliability Engineer (SRE) – Observability
Astra North Infoteck Inc.
Production Support Engineer BFSI Domain
Astra North Infoteck Inc.
MS Intune Endpoint Management Engineer
Astra North Infoteck Inc.
Full Stack Developer - React.js, Node.js, SAP Commerce Cloud
Astra North Infoteck Inc.
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free