Skip to main content
GTN Technical Staffing logo

HPC Compute Platform Engineer

GTN Technical Staffing
Full TimemidHybrid
Dallas, Texas, USPosted April 8, 2026

Resume Keywords to Include

Make sure these keywords appear in your resume to improve ATS scoring

KubernetesTerraformAnsibleLinuxCI/CD

Sign up free to auto-tailor your resume with all these keywords and get a higher ATS score

Job Description

HPC Compute Platform Engineer

Location: Dallas, TX (Hybrid)

Type: Direct Hire

  • Competitive base salary + performance bonus
  • 100% company-paid benefits

Overview

We are seeking a Compute Platform Engineer to support the reliability, performance, and operational health of large-scale, high-performance compute infrastructure supporting critical research and production workloads.

This role is responsible for maintaining and troubleshooting CPU and GPU-based compute platforms, ensuring consistent performance at scale, and driving operational excellence across the environment. The position works closely with platform engineering, infrastructure, operations teams, and hardware vendors to support a stable and highly available compute ecosystem.

The ideal candidate brings strong hands-on experience with HPC or AI infrastructure, deep knowledge of server hardware, and a proactive approach to troubleshooting, automation, and continuous improvement.

Key Responsibilities

Compute Infrastructure Engineering

  • Design, configure, and manage high-performance compute infrastructure composed of CPU and GPU nodes
  • Support large-scale HPC and AI platforms, ensuring systems are stable, performant, and production-ready
  • Perform diagnostics, tuning, and capacity planning to support efficient scale-out of compute environments

Hardware Reliability & Lifecycle Management

  • Manage full firmware and BIOS lifecycle across compute infrastructure, including baselines, validation, rollout, and compliance
  • Troubleshoot complex hardware issues across CPU, GPU, DPU, NVSwitch, NICs, memory, PSU, and BMC components
  • Drive root cause analysis and implement solutions to improve system reliability and reduce recovery time
  • Analyze hardware lifecycle processes and recommend improvements for optimization and efficiency

Automation & Platform Operations

  • Automate health checks, onboarding workflows, and operational processes to improve deployment efficiency
  • Leverage Infrastructure-as-Code (IaC) methodologies to enable scalable and repeatable infrastructure management
  • Recommend and implement tooling and process improvements to enhance platform operations

Vendor & Cross-Functional Collaboration

  • Collaborate with hardware vendors to resolve firmware and system issues, providing detailed diagnostics, logs, and impact analysis
  • Work closely with infrastructure, platform, and operations teams to align on system performance and reliability goals
  • Support integration of hardware improvements across the broader environment

Monitoring, Performance & Security

  • Monitor hardware performance and identify opportunities for optimization
  • Implement best practices for platform security and system hardening
  • Ensure adherence to operational standards and data center processes

Technical Leadership

  • Act as a subject matter expert for compute infrastructure and hardware-related issues
  • Mentor junior engineers and contribute to a culture of continuous improvement and technical excellence

Required Experience

  • 3+ years of hands-on experience supporting large-scale compute platforms, HPC, or AI infrastructure
  • Strong experience with HPE server platforms such as ProLiant and Apollo
  • Experience working with NVIDIA GPUs, including A100, H100/H200, or similar
  • Solid understanding of server architecture including UEFI/BIOS, PCIe devices, and out-of-band management systems (iLO, BMC)
  • Proven ability to troubleshoot complex hardware issues and coordinate with vendors for resolution
  • Experience with Linux in high-performance or latency-sensitive environments
  • Familiarity with core networking concepts including DNS, DHCP, VLANs, switching, and routing
  • Experience working within data center environments and operational processes

Technical Skills

  • Experience with automation tools such as Ansible, Terraform, and CI/CD pipelines
  • Exposure to Infrastructure-as-Code (IaC) practices
  • Working knowledge of Kubernetes and/or OpenStack (preferred)
  • Strong problem-solving and analytical skills with the ability to operate in complex environments

Preferred Experience

  • Experience supporting AI platforms or next-generation GPU architectures
  • Exposure to large-scale distributed compute environments
  • Experience working in mission-critical or high-availability infrastructure environments

Want AI-powered job matching?

Upload your resume and get every job scored, your resume tailored, and hiring manager emails found - automatically.

Get Started Free