AI Engineer, AI & Applications

Role Summary

The AI Engineer will set up and build the MLOps and AIOps foundations for Firmus AI Factory, our AI platform, to make it trustworthy, repeatable, and scalable. This is a pioneering role where you will establish the end-to-end MLOps workflows—turning model development into a disciplined release process with clear governance, automated evaluation gates, and reliable promotion to production. You will also enable our Model Arena initiative by operationalizing the evaluation pipelines and standards so model choices for RAG and agentic applications are data-driven, reproducible, and production-safe. You are also the reliability owner for all Firmus AI Factory AI features: training jobs, inference services, and RAG systems. You'll define quality gates, model promotion workflows, production monitoring, and incident response procedures. Your job is to make AI features as trustworthy as core infrastructure—fast, reliable, and observable. You'll work across the entire team: partnering with engineers on CI/CD gates, with data scientists on quality metrics, and with ops on L2/L3 incident response.

Key Responsibilities

• Design and own end-to-end MLOps workflows: training → evaluation → registry → deployment → monitoring → retraining/retirement in dev/staging/production environments, with clear standards and ownership boundaries.
• Own the model registry and promotion lifecycle (MLFlow): stage/alias strategy, approvals, environment separation, access control, and rollback readiness.
• Establish reproducibility and lineage across the model lifecycle: versioned code/config, artifact traceability, dataset/version references, and repeatable evaluation runs.
• Design and implement automated model quality gates for production (quality such as accuracy and latency, cost, and safety etc).
• Define SLOs/SLIs for all AI features: training job success rate, inference latency p99, RAG retrieval accuracy, availability, cost metrics.
• Build production monitoring dashboards: track model performance, data drift, operational health; integrate with alerting (PagerDuty, Slack, etc.).
• Create on-call runbooks and triage procedures for AI service incidents; lead postmortem-driven improvements.
• Instrument AI services for debugging: request traces, GPU metrics per-model, retrieval performance, communication bottlenecks.
• Integrate evaluation frameworks (benchmarking, RAGAS, LLM-as-judge) into CI/CD pipelines.

Skills & Experience

5–8 years in MLOps / ML platform / production engineering roles with hands-on ownership of production ML delivery pipelines.
Deep understanding of ML lifecycle: model versioning, promotion strategies, evaluation automation, governance, deployment strategies, monitoring, drift detection.
Hands-on experience with MLflow Model Registry workflows (stages/aliases, approvals, traceability) and integrating registry actions into release pipelines.
Experience operationalizing model evaluation systems (metrics standards, orchestration, logging, reproducibility)
Strong observability and production fundamentals: metrics/logs/traces, alert design, incident response, and reliability mindset.
Familiarity with CI/CD pipelines, model packaging, and deployment automation, comfortable collaborating across ML engineers, platform/SRE, and application teams to turn requirements into robust workflows.
Understanding of distributed systems, resource management, and failure modes in training/inferencing environments.

Key Competencies

Production Ownership: comfortable owning services in production; proactive about monitoring, alerting, and preventing issues.
Reliability Engineering: can define SLOs, error budgets, and blameless postmortem culture.
Cross-Functional Leadership: works with ML engineers, data scientists, and platform teams; unblocks reliably.
Incident Response: triage skills, root cause analysis, systemic thinking (not just fighting fires).
Programmatic automation reduces toil and makes the right path with a balanced rigor with speed.
Communication: explains complex ML/systems issues clearly to both technical and non-technical stakeholders.

Success Metrics

Reproducible, auditable model release workflow becomes the default across teams (clear lineage and consistent promotion standards).
Automated evaluation gates prevent the majority of quality/performance regressions from reaching production.
Model registry and deployment practices support safe rollouts and fast rollbacks with minimal disruption.
Reliable AI services (SLO-driven): training/inference/RAG services consistently meet reliability targets and error budgets.
Faster detection and recovery: incident MTTD/MTTR improves over time and repeated incident classes reduce.
Higher signal-to-noise alerting: fewer redundant alerts per true incident through correlation/deduplication improvements.
Operational automation maturity increases: more incident classes handled with consistent triage and safe automation.

Location & Reporting

Singapore or Australia (Launceston, TAS or Sydney, NSW)
Reporting to Head of AI & Applications

Employment Basis

Full-time

Diversity

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

About Sustainable Metal Cloud

Our vision is to move cloud computing towards net zero, with solutions forged through advanced technology. Partnering with NVIDIA to provide large-scale GPU AI infrastructure.

WHY YOU'LL LOVE WORKING HERE

Our team shares a passion for possibility, knowing that our technology enables ideas across the world. Ideas that can reshape the course of progress and break down traditional boundaries.