HPC Compute Platform Engineer
GTN Technical StaffingResume Keywords to Include
Make sure these keywords appear in your resume to improve ATS scoring
Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score
Job Description
HPC Compute Platform Engineer
Location: Dallas, TX (Hybrid)
Type: Direct Hire
- Competitive base salary + performance bonus
- 100% company-paid benefits
Overview
We are seeking a Compute Platform Engineer to support the reliability, performance, and operational health of large-scale, high-performance compute infrastructure supporting critical research and production workloads.
This role is responsible for maintaining and troubleshooting CPU and GPU-based compute platforms, ensuring consistent performance at scale, and driving operational excellence across the environment. The position works closely with platform engineering, infrastructure, operations teams, and hardware vendors to support a stable and highly available compute ecosystem.
The ideal candidate brings strong hands-on experience with HPC or AI infrastructure, deep knowledge of server hardware, and a proactive approach to troubleshooting, automation, and continuous improvement.
Key Responsibilities
Compute Infrastructure Engineering
- Design, configure, and manage high-performance compute infrastructure composed of CPU and GPU nodes
- Support large-scale HPC and AI platforms, ensuring systems are stable, performant, and production-ready
- Perform diagnostics, tuning, and capacity planning to support efficient scale-out of compute environments
Hardware Reliability & Lifecycle Management
- Manage full firmware and BIOS lifecycle across compute infrastructure, including baselines, validation, rollout, and compliance
- Troubleshoot complex hardware issues across CPU, GPU, DPU, NVSwitch, NICs, memory, PSU, and BMC components
- Drive root cause analysis and implement solutions to improve system reliability and reduce recovery time
- Analyze hardware lifecycle processes and recommend improvements for optimization and efficiency
Automation & Platform Operations
- Automate health checks, onboarding workflows, and operational processes to improve deployment efficiency
- Leverage Infrastructure-as-Code (IaC) methodologies to enable scalable and repeatable infrastructure management
- Recommend and implement tooling and process improvements to enhance platform operations
Vendor & Cross-Functional Collaboration
- Collaborate with hardware vendors to resolve firmware and system issues, providing detailed diagnostics, logs, and impact analysis
- Work closely with infrastructure, platform, and operations teams to align on system performance and reliability goals
- Support integration of hardware improvements across the broader environment
Monitoring, Performance & Security
- Monitor hardware performance and identify opportunities for optimization
- Implement best practices for platform security and system hardening
- Ensure adherence to operational standards and data center processes
Technical Leadership
- Act as a subject matter expert for compute infrastructure and hardware-related issues
- Mentor junior engineers and contribute to a culture of continuous improvement and technical excellence
Required Experience
- 3+ years of hands-on experience supporting large-scale compute platforms, HPC, or AI infrastructure
- Strong experience with HPE server platforms such as ProLiant and Apollo
- Experience working with NVIDIA GPUs, including A100, H100/H200, or similar
- Solid understanding of server architecture including UEFI/BIOS, PCIe devices, and out-of-band management systems (iLO, BMC)
- Proven ability to troubleshoot complex hardware issues and coordinate with vendors for resolution
- Experience with Linux in high-performance or latency-sensitive environments
- Familiarity with core networking concepts including DNS, DHCP, VLANs, switching, and routing
- Experience working within data center environments and operational processes
Technical Skills
- Experience with automation tools such as Ansible, Terraform, and CI/CD pipelines
- Exposure to Infrastructure-as-Code (IaC) practices
- Working knowledge of Kubernetes and/or OpenStack (preferred)
- Strong problem-solving and analytical skills with the ability to operate in complex environments
Preferred Experience
- Experience supporting AI platforms or next-generation GPU architectures
- Exposure to large-scale distributed compute environments
- Experience working in mission-critical or high-availability infrastructure environments
Want AI-powered job matching?
Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.
Get Started Free