REMOTE AI Support Operations Engineer
CyberCodersResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
Title: AI Support Operations Engineer
Location: Fully REMOTE!
Salary: $150-200k/year + BONUS + RSUs
We're not following someone else's cloud blueprint - we're creating the next one. While legacy providers hand you a finished process, we're engineering the next generation of AI-optimized data center infrastructure from the ground up.
As our first internal Staff AI Support Operations Engineer, you'll be a foundational technical leader on a brand-new Ops team. This is a role for an architect-practitioner: the kind of engineer who can untangle a complex InfiniBand issue one hour and automate away the root cause the next. You won't just maintain systems - you'll build the operational standards and technical foundations that every future engineer will rely on.
Key Responsibilities
- Cluster Engineering & Operations: Collaborate with engineering teams to architect, deploy, and bring new AI compute clusters online while delivering expert-level support for existing high-density GPU environments
- Infrastructure Source of Truth: Own NetBox and related internal systems, ensuring all infrastructure data is accurate, consistent, and reliably maintained
- Automation & Tooling: Build and refine internal automation using Python, Ansible, and Terraform to eliminate manual workflows and modernize fragile legacy processes
- Tier 3 Escalation Lead: Serve as the highest technical escalation point for customer and internal issues prior to involvement from Platform or Network/Undercloud teams
- Documentation Excellence: Transform tribal knowledge into clear, durable SOPs and technical documentation that establish the operational "gold standard"
- Technical Leadership & Mentorship: Raise the technical bar for the team through code reviews, architectural guidance, and mentorship as the organization scales
Qualifications
- Enterprise-Grade Server Proficiency: Advanced operational knowledge of HPE, Dell, and SuperMicro platforms, including IPMI, BMC, iDRAC workflows, and familiarity with Redfish-based management.
- Core Engineering Toolkit: Mastery of Python, Ansible, and Terraform as primary tools for automation, orchestration, and infrastructure lifecycle management.
- Linux Performance Engineering: Strong capability in diagnosing and tuning Linux systems, resolving performance bottlenecks, and optimizing workloads at the OS level.
- Advanced Incident Resolution: Demonstrated experience serving as the final technical escalation point for complex, high-impact infrastructure failures.
- Cloud-Native Operations: Proven production experience operating and troubleshooting Kubernetes environments.
Nice to have
- Next-Generation GPU Hardware: Familiarity with NVIDIA Blackwell (B200/B300) or Hopper (H100/H200) architectures.
- High-Performance Fabrics: Experience with InfiniBand or RoCE networking, and modern high-throughput storage platforms such as Weka or VAST Data.
- Bare-Metal Provisioning: Exposure to OpenStack or Canonical MAAS for automated provisioning of physical infrastructure.
Legacy is predictable. Safe. Slow. We're none of those things. We're building the Neo-Cloud at AI speed, and the rules aren't handed to you - you define them. If you're ready to trade routine for impact and build systems that actually move the company forward, let's talk.
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free