Senior HPC Infrastructure Engineer (Compute System)

Role Summary

Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.

You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.

Key Responsibilities

Own the end-to-end lifecycle of AI compute systems, including GPU compute, NVSwitch, and platform firmware (BIOS, GPU, NIC, and storage devices).
Define, maintain, and enforce supported firmware and driver compatibility matrices across hardware generations, operating systems, kernels, and AI software stacks.
Lead firmware qualification and regression testing to ensure updates do not introduce performance degradation, instability, or compatibility issues.
Investigate and remediate performance regressions caused by firmware, driver, or system-level changes, working closely with networking, storage, and HPC engineers.
Collaborate to integrate firmware and performance checks into SDI tooling, enabling automated validation during provisioning, upgrades, and cluster bring-ups.
Produce clear technical documentation, including firmware standards, validation reports, and benchmarking results, to support operational consistency and informed decision-making.
Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.
Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks.
Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.
Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI clusters for large-scale GPU cluster commissioning.

Skills & Experience

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
Experience with bare-metal cluster provisioning using tools such as Metal3, OpenStack Ironic, MaaS, xCAT, or similar.
Hands-on expertise with platform firmware and low-level system components, including BIOS, BMC, GPU firmware, NIC firmware, and storage devices.
Proven experience managing firmware and driver compatibility across operating systems, Linux kernels, and AI software stacks, with a disciplined approach to version control and validation.
Solid understanding of GPU architecture and interconnects, including PCIe, NVLink, and GPU-to-GPU communication patterns.
Demonstrated experience in performance benchmarking and validation using industry-standard and custom tools to measure GPU, compute, storage, and interconnect performance.
Strong Linux systems knowledge, including kernel behaviour, driver management, performance tuning, and troubleshooting at the OS and hardware boundary.
Experience diagnosing and resolving performance regressions related to firmware, drivers, or system-level changes in production or pre-production environments.
Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.
Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.
Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.
Excellent documentation skills with a high level of attention to detail.
Experience participating in an on-call rotation supporting production services.
Proactive self-starter with a drive for continuous technical improvement.

Key Competencies

Ability to understand AI compute platforms as end-to-end systems spanning hardware, firmware, operating systems, drivers, and workloads.
Ability to anticipate cross-layer impacts of changes and design solutions that optimise overall system performance and reliability.
Proactively identifies risks related to firmware upgrades and ensures compatibility through structured validation and rollback strategies.
Experience operating AI infrastructure at medium to large scale, with a focus on reliability, repeatability, and performance consistency.
Strong sense of ownership and accountability for system performance and reliability.
Comfortable operating in ambiguous, fast-evolving environments while driving continuous improvement.

Success Metrics

Reliable, automated firmware validation and upgrade systems and processes.
Performance validation and optimisation.
Improved operational efficiency.
High-quality documentation and effective knowledge transfer.

Location & Reporting

Sydney, NSW or Hobart/Launceston, TAS
Reporting to Senior Manager, Software Defined Infrastructure

Employment Basis

Full-time

Diversity

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

About Sustainable Metal Cloud

Our vision is to move cloud computing towards net zero, with solutions forged through advanced technology. Partnering with NVIDIA to provide large-scale GPU AI infrastructure.

WHY YOU'LL LOVE WORKING HERE

Our team shares a passion for possibility, knowing that our technology enables ideas across the world. Ideas that can reshape the course of progress and break down traditional boundaries.