Site Reliability Engineer AI & Data Platforms (LLM & Kubernetes)
POWER COZMOResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
As an AI Support Engineer at Power Cozmo (India & Jordan), based in Amman, Jordan, you will play a crucial role in maintaining and supporting Kubernetes clusters for our production AI and data systems. Your responsibilities will involve a range of tasks related to infrastructure support, local LLM platform maintenance, data scraping, system monitoring, incident support, and automation.
Key Responsibilities:
- **Kubernetes & Infrastructure Support**
- Deploy, manage, and support production-grade Kubernetes clusters
- Troubleshoot pod, node, networking, and storage issues in live environments
- Manage Helm charts, ConfigMaps, Secrets, and Kubernetes manifests
- Manage infrastructure and application deployments via Git repositories
- **Local LLM & AI Platform Support**
- Support local LLM deployments (Ollama, Mistral, LLaVA, Qwen, etc.)
- Troubleshoot inference performance, memory issues, and model loading failures
- Support AI services exposed via REST APIs or internal microservices
- **Data Scraping & Crawling**
- Support and maintain web scraping and crawling pipelines
- Debug data extraction failures, rate limits, and anti-bot challenges
- Ensure data quality, consistency, and pipeline reliability
- Assist in scheduling and monitoring crawlers (cron jobs)
- **System Monitoring & Incident Support**
- Perform root cause analysis for production incidents
- Monitor logs, metrics, and alerts using tools like Prometheus, Grafana
- Maintain uptime and SLAs for AI and data platforms
- Participate in on-call or escalation support rotations
- **Automation**
- Maintain documentation for support procedures and known issues
- Collaborate with engineering teams for fixes and optimizations
Required Skills & Qualifications:
- 5+ years experience as a Data / AI Support Engineer
- Hands-on experience with Docker and containerized workloads
- Experience supporting local LLM models or AI inference systems
- Practical knowledge of Linux system administration
- Experience with data scraping or web crawling systems
- Basic scripting skills (Python, Bash)
Preferred / Good to Have:
- Familiarity with Neo4j, Kafka, or data pipelines
- Experience with Ollama, Hugging Face models, or similar LLM runtimes
- Knowledge of cloud-native monitoring tools
- Understanding of networking concepts (DNS, ingress, load balancers)
In this role, you will have the opportunity to work on cutting-edge AI and LLM infrastructure both over the cloud and local AI servers in a fast-paced startup or scale-up environment, offering significant career growth opportunities in the field of AI. As an AI Support Engineer at Power Cozmo (India & Jordan), based in Amman, Jordan, you will play a crucial role in maintaining and supporting Kubernetes clusters for our production AI and data systems. Your responsibilities will involve a range of tasks related to infrastructure support, local LLM platform maintenance, data scraping, system monitoring, incident support, and automation.
Key Responsibilities:
- **Kubernetes & Infrastructure Support**
- Deploy, manage, and support production-grade Kubernetes clusters
- Troubleshoot pod, node, networking, and storage issues in live environments
- Manage Helm charts, ConfigMaps, Secrets, and Kubernetes manifests
- Manage infrastructure and application deployments via Git repositories
- **Local LLM & AI Platform Support**
- Support local LLM deployments (Ollama, Mistral, LLaVA, Qwen, etc.)
- Troubleshoot inference performance, memory issues, and model loading failures
- Support AI services exposed via REST APIs or internal microservices
- **Data Scraping & Crawling**
- Support and maintain web scraping and crawling pipelines
- Debug data extraction failures, rate limits, and anti-bot challenges
- Ensure data quality, consistency, and pipeline reliability
- Assist in scheduling and monitoring crawlers (cron jobs)
- **System Monitoring & Incident Support**
- Perform root cause analysis for production incidents
- Monitor logs, metrics, and alerts using tools like Prometheus, Grafana
- Maintain uptime and SLAs for AI and data platforms
- Participate in on-call or escalation support rotations
- **Automation**
- Maintain documentation for support procedures and known issues
- Collaborate with engineering teams for fixes and optimizations
Required Skills & Qualifications:
- 5+ years experience as a Data / AI Support Engineer
- Hands-on experience with Docker and containerized workloads
- Experience supporting local LLM models or AI inference systems
- Practical knowledge of Linux system administration
- Experience with data scraping or web crawling systems
- Basic scripting skills (Python, Bash)
Preferred / Good to Have:
- Familiarity with Neo4j, Kafka, or data pipelines
- Experience with Ollama, Hugging Face models, or similar LLM runtimes
- Knowledge of cloud-native monitoring tools
- Understanding of networking concepts (DNS, ingress, load balancers)
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free