Scope of Work: HPC Cluster Deployment

Automate the deployment process of HPC clusters using CI/CD pipelines by utilizing GitHub pipeline and AWS Systems Manager.

Implement CI/CD pipelines to manage and deploy updates to the HPC cluster efficiently.

Set up and configure HPC clusters to meet specific requirements and workloads.

Manage and maintain HPC hardware components such as CPUs and GPUs, along with the necessary software.

Conduct regression testing to verify the functionality and performance of non-GXP HPC clusters.

Workload Scheduler Management:

Install and configure workload managers and schedulers like LSF, SLURM, and PBS Pro.

Manage the addition and removal of compute nodes and adjust the priority of master and slave nodes.

Develop and manage resource policies and rules to optimize cluster performance.

Configure and allocate resources such as CPU and memory, and profile applications for optimal performance.

Address and resolve issues related to schedulers, daemons, and license servers.

Network and High-Performance Connectivity Management:

Install and configure HPC interconnect networks.

Design and configure the network topology for HPC clusters.

Ensure the maintenance and monitoring of InfiniBand connectivity.

Resolve connectivity issues related to InfiniBand, RoCE, and Ethernet.

Monitoring and Reports:

Produce daily health check reports for the HPC cluster.

Automate monitoring scripts to streamline the monitoring process.

Conduct periodic reviews of reports and audit trails.

OS Administration and Management:

Install and configure operating systems for HPC clusters.

Address OS-related issues such as CPU, memory, and SWAP utilization, and perform application file system cleanup.

Ensure application service continuity by performing pre and post checks from both OS and application perspectives during planned and unplanned outages.

Applications and Tools:

Install HPC libraries and tools such as MPI and compilers.

Install and configure HPC applications, both commercial off-the-shelf (COTS) and open source, and manage packages using Spack.

Apply patches and upgrades to HPC applications.

Resolve issues related to HPC applications.

HPC Storage Management:

Administer and configure HPC storage systems.

Oversee the administration of HPC file systems.

Monitor and troubleshoot HPC storage systems.

Manage backup and tape library systems.

Below is the key responsibility, essential skills of the resources we will deploy.

Key Responsibilities

Cluster Management: Install, configure, and maintain compute nodes, GPUs (NVIDIA), high-speed storage (Lustre, GPFS),

and interconnects (InfiniBand, RoCE).

Performance Tuning: Optimize scientific applications, kernels, and workflows for maximum throughput, scalability, and minimal queue times.

User Support: Act as a technical expert for researchers, debugging jobs, resolving complex issues, and providing training on tools and best practices.

Software Management: Manage workload managers (Slurm, LSF), schedulers, software licensing (FlexLM), OpenPBS,

containers (Singularity), and compilers.

Infrastructure: Administer high-speed interconnects (InfiniBand), storage (Lustre, CEPH), and potentially cloud/hybrid solutions.

Implement and manage monitoring (Grafana, Prometheus) and orchestration tools (Slurm, Kubernetes).

Automation: Develop scripts (Python, Ansible) for provisioning, monitoring, and automating routine tasks.

Security & Policy: Implement and enforce security policies, manage user access, and oversee lifecycle management.

Essential Skills & Qualifications

Technical Expertise: Strong Linux, Python, scripting (Ansible, Terraform), HPC schedulers (Slurm), networking (InfiniBand), and GPU computing.

Team will have knowledge of Gilead systems and AWS CICD pipelines.

HPC Domain Knowledge: Experience with parallel file systems, workload management, and performance analysis tools.

Problem Solving: Excellent analytical and debugging skills for complex distributed systems.

Communication: Ability to explain complex technical issues to scientists and non-technical stakeholders.

Experience: Hands-on experience in data centers, managing large clusters, and supporting diverse scientific/AI workloads.

Job Type: Fixed term contract

Contract length: 12 months

Pay: $110,000.00-$125,000.00 per year

Work Location: Remote

HPA Platform Engineer

Resume Keywords to Include

Job Description

Want AI-powered job matching?