
SRE (Site Reliability Engineer) +AI
Tekgence IncResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Role: SRE (Site Reliability Engineer) +AI
Hyrbid: 3 days in office- Face 2 Face interview
Location: Montreal
Experience: 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.
Roles and Responsibilities:
- Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
- Design and build automation for core platform capabilities, reducing manual toil
- Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards.
Skills
- Production experience in SRE / Infrastructure / ops for large-scale systems
- Strong programming/scripting skills (Python, Go, Java, or equivalent)
- Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
- Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
- Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
- Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)
- Solid experience in capacity planning, performance tuning, scaling, and incident response
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free