NVIDIA H200 now available
Back

AI Engineer, AI & Applications

Role Summary

The AI Engineer will establish Firmus AI Factory as the foundation for efficient, production-grade distributed training by delivering pre-built training recipes (TorchTitan, Megatron etc.), evaluation benchmarks, and model guidance. You'll work with customers and internal teams to optimize training efficiency, define baselines, and document best practices. Your templates and benchmarks are the anchor point for our hyperscale customers' training workflows and our model arena differentiator.


Key Responsibilities

  • Build production-ready training recipes using TorchTitan and Megatron-LM: model configs, parallelism strategies (FSDP, tensor/pipeline parallelism), checkpointing patterns.
  • Document parameter tuning for different scales (e.g., "to train Llama 7B on 8xH100s, use this config and expect X throughput").
  • Create and validate multi-node NCCL communication patterns on AI Factory K8s/Slurm clusters.
  • Design and build benchmarking suites: accuracy, latency, throughput (tokens/sec), cost per token, energy efficiency, MFU.
  • Implement offline evaluation harnesses for standardized model comparison and leaderboard tracking.
  • Conduct fine-tuning experiments (LoRA, QLoRA) where they improve product outcomes (e.g., ops domain data), document gains.
  • Create training efficiency playbooks and publish benchmark results so customers can optimize workloads.
  • Partner with job scheduling and orchestration engineers on template integration and other AI engineers and software engineers on model optimization trade-offs for inferencing and AI applications.


Skills & Experience

  • 5–7 years of experience in distributed machine learning (PyTorch/JAX, FSDP, DeepSpeed, multi-node training at 10+ GPUs).
  • Expert-level understanding of GPU optimization: utilization, memory patterns, communication bottlenecks (NCCL collectives).
  • Hands-on distributed training at scale: debugged convergence issues, profiled bottlenecks, optimized throughput.
  • Strong benchmarking methodology: design-controlled experiments, measure noise, communicate results rigorously.
  • Familiarity with TorchTitan, Megatron-LM, or similar production training frameworks. 
  • Understanding of model parallelism strategies and trade-offs (FSDP vs. tensor parallelism vs. pipeline parallelism etc.).


Key Competencies

  • Distributed Systems Mastery: can explain NCCL, collective communications, and scaling inefficiency.
  • Benchmarking Rigor: doesn't just run benchmarks; validates assumptions, explains variance, communicates uncertainty.
  • Production Thinking: understands checkpointing, recovery, resource constraints, and cost optimization.
  • Mentorship: can guide engineers on training best practices and debugging distributed training issues.
  • Documentation: creates clear, actionable playbooks that customers can follow.


Success Metrics

  • Benchmark credibility & decision impact increases: benchmarks are trusted and used to drive model/hardware/product decisions.
  • Training efficiency leadership: sustained improvement in benchmarked training efficiency on representative workloads.
  • Shorter time-to-validate new models: model candidates can be evaluated quickly and consistently end-to-end.
  • Template effectiveness improves: recipes reduce misconfigurations and repeated setup failures; fewer training config escalations.
  • Competitive differentiation strengthens: model arena outputs influence customer adoption and internal roadmap priorities.


Location & Reporting

  • Singapore or Australia (Launceston, TAS or Sydney, NSW)
  • Reporting to Head of AI & Applications


Employment Basis

Full-time


Diversity

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions. 

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure. 

About Sustainable Metal Cloud

Our vision is to move cloud computing towards net zero, with solutions forged through advanced technology. Partnering with NVIDIA to provide large-scale GPU AI infrastructure.

WHY YOU'LL LOVE WORKING HERE

Our team shares a passion for possibility, knowing that our technology enables ideas across the world. Ideas that can reshape the course of progress and break down traditional boundaries.