Senior Platform Engineer
ROLE
Firmus Technologies is seeking a Senior Platform Engineer to join our Engineering and Technology team. You will drive the design and implementation of our MLOps capability. You will also collaborate with other engineers and make technical decisions on scaling Firmus AI factory platform engineering capabilities to planet scale, from IaC, container orchestration, observability, self-service portal to platform security. This role is ideal for a self-starter with passion for building things from first principles. You naturally break down complex problems into their fundamental truths to uncover novel and elegant solutions - rather than relying on conventional patterns.
KEY RESPONSIBILITIES
- Build MLOps capabilities from the ground up, enabling reproducible, scalable, and secure ML workflows across internal and customer-facing environments.
- Continuously improve our DevOps platform to ensure reliability, scalability, security, and seamless integration with CI/CD pipelines and infrastructure services.
- Design, implement, operate and secure Kubernetes-based production infrastructure for high reliability, performance and security, including clusters supporting NVIDIA GB300 NVL72 systems with NVIDIA Quantum-X800 InfiniBand or Spectrum-X Ethernet.
- Develop world-class observability platforms for internal and external customers to achieve ClusterMAX Platinum tier recognition from SemiAnalysis.
- Integrate Firmus central services with NVIDIA’s software stack, including Mission Control, NETQ, UFM, and NMX.
- Lead the enhancement and evangelism of internal platform products that provide cohesive, composable, secure-by-default, and low-friction self-service experiences that accelerates time to market and reduce engineers' cognitive load.
- Drive incident response efforts, participate actively in the on-call rotation, and lead detailed Root Cause Analysis (RCA) to continuously improve system reliability, operational maturity, and incident handling processes.
SKILLS AND EXPERIENCE
- Bachelor's degree in computer science or a related technical field.
- 7+ years of experience as Platform Engineer, Site Reliability Engineer, DevOps engineer, MLOps Engineer or Observability Engineer.
- Demonstrated strong proficiency on the following areas:
- Infrastructure-as-Code, configuration management and CI/CD (e.g., Terraform, Ansible, GitHub Actions, Jenkins, ArgoCD).
- Containerization technologies (e.g., Docker), Kubernetes networking and cluster management, including upgrades and troubleshooting.
- Observability stack design and scaling (e.g., Loki, Grafana, Tempo, Prometheus, Thanos, ClickHouse).
- Telemetry solutions using various technology (e.g., Redfish, gNMI, SNMP, eBPF, streaming analytics).
- Unified telemetry collection with OpenTelemetry.
- Compliance automation (e.g., OPA, Kyverno).
- Competent in scripting and programming skills (e.g., Bash, Python, Go).
- Systems knowledge on Linux internals, networking stacks, and distributed storage.
- Clear and effective English communication, written and spoken.
- Bonus: Experience in high-growth startups or regulated industries with robust security and data privacy requirements, including SOC 2 Type 2 and ISO 27001.
About Sustainable Metal Cloud
Our vision is to move cloud computing towards net zero, with solutions forged through advanced technology. Partnering with NVIDIA to provide large-scale GPU AI infrastructure.
WHY YOU'LL LOVE WORKING HERE
Our team shares a passion for possibility, knowing that our technology enables ideas across the world. Ideas that can reshape the course of progress and break down traditional boundaries.