Senior HPC Infrastructure Engineer

Role Summary

Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.

You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.

Key Responsibilities

Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs.
Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking.
Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning.
Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models.
Design, deploy, and fine-tune Slurm GPU clusters with topology-aware configurations.
Develop and execute performance benchmarking workloads, including MLPerf, NCCL tests, microbenchmarks, and throughput/latency validation.
Establish observability across GPU, InfiniBand fabric, storage, and provisioning components.
Document architecture designs, operational procedures, and performance results.
Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.
Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks.
Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.
Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI workload performance for large-scale GPU cluster commissioning.

Skills & Experience

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
Experience with bare-metal cluster provisioning using tools such as Metal3, OpenStack Ironic, MaaS, xCAT, or similar.
Deep knowledge of Kubernetes internals, including CRDs, controllers, operators, and cluster lifecycle management.
Strong understanding of Slurm configuration and compiling AI and HPC applications.
Strong understanding of GPU systems (NVIDIA H100/H200 SXM platforms), CUDA/NCCL, and GPU topology (NVLink, NVSwitch, PCIe).
Familiarity with container runtimes for compute workloads, including Docker, Enroot, Singularity, and Podman.
Experience with benchmarking and performance validation for AI, HPC, or distributed training workloads.
Practical Linux systems engineering experience, including kernel, cgroups, system services, networking, and drivers.
Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.
Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.
Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.
Excellent documentation skills with strong attention to detail.
Experience participating in an on-call rotation supporting production services.
Proactive self-starter with a drive for continuous technical improvement.

Key Competencies

Systems Architecture: Ability to design and integrate bare-metal, GPU, RDMA, and Kubernetes/Slurm platforms.
Infrastructure Automation: Skilled in automated provisioning and lifecycle management of hardware and clusters.
GPU and HPC Performance: Understanding of GPU systems, RDMA fabrics, and distributed AI workload performance.
Technical Communication: Ability to communicate technical concepts effectively across diverse engineering and operations teams.
Continuous Improvement: Demonstrates curiosity, proactive learning, and innovation in AI and HPC infrastructure.

Success Metrics

Reliable provisioning of Kubernetes and Slurm AI clusters.
Performance validation and optimisation.
Improved operational efficiency.
High-quality documentation and effective knowledge transfer.

Location & Reporting

Australia (Sydney, NSW or Launceston, TAS)
Reporting to Senior Manager, Software Defined Infrastructure

Employment Basis

Full-time

Diversity

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

About Sustainable Metal Cloud

Our vision is to move cloud computing towards net zero, with solutions forged through advanced technology. Partnering with NVIDIA to provide large-scale GPU AI infrastructure.

WHY YOU'LL LOVE WORKING HERE

Our team shares a passion for possibility, knowing that our technology enables ideas across the world. Ideas that can reshape the course of progress and break down traditional boundaries.