Site Reliability Engineer
Firmus Technologies
Firmus Technologies is a global leader pioneering the solution to AI’s energy challenge, founded in Australia in 2019 by a visionary team of entrepreneurs. Our mission is to create the most energy-efficient AI infrastructure, combining cutting edge technology with a steadfast commitment to sustainability.
Through ground-breaking research and development, we invented a verticalized AI Factory - a new class of digital infrastructure that replaces traditional data centres. Built on new approaches to liquid cooling, energy management, water use and modular construction methodology, the Firmus AI Factory delivers low-cost AI tokens across Asia-Pacific.
Firmus AI Cloud
We provide customers with access to energy savings via our large-scale GPU cloud, Firmus AI Cloud. Rated Silver in The GPU Cloud ClusterMAX™ Rating System, our cloud empowers developers, enterprise, education and government users to train AI models with unmatched efficiency and cost savings. With an ever-growing list of services and applications, we are committed to building a cloud experience for our customers that is market-leading, proprietary and built to scale.
Why you’ll love working here
- A fast-paced and dynamic environment working with next-gen technology. You’ll be operating at the intersection of sustainability and artificial intelligence – helping to transform an industry.
- Working with and access to colleagues who are true innovators and leaders in their field.
- As an emerging company, we work as a close-knit team. Work with the founders, grow a strong network, and witness the impact you make first-hand as we democratise AI tools for everyone – more sustainably, and more affordably.
- We believe that people from diverse backgrounds come together to do their best work, be their authentic selves, and build great things. We are proud to be an equal opportunity employer.
ROLE SUMMARY
Firmus Technologies is seeking a skilled Site Reliability Engineer to join our Operations team, supporting the daily operations and maintenance of our AI-accelerated High-Performance Computing (HPC) infrastructure. This role will work closely with Field Service Engineers, HPC and Network Engineering teams, and assist the Global Operations Centre (GOC). This is a unique opportunity to contribute directly to the stability and growth of cutting-edge AI infrastructure.
KEY RESPONSIBILITIES
- Support in the deployment, configuration, and maintenance of various high-end GPU servers, storage servers, networking equipment and software components in highly secure environments.
- Perform hardware diagnostics, systems functionality and firmware updates as required.
- Collaborate with engineering teams to assist in tailored customer environments deployment (eg: bare-metal systems, HPC Clusters, Kubernetes, Slurm etc).
- Serve as first line of engineering support for onsite operational issues, including troubleshooting hardware, network and software problems, and firmware compliance.
- Troubleshoot incidents, escalate critical issues and provide feedback to appropriate teams for improvements.
- Participate in an on-call rotation to ensure 24/7 availability and responsiveness to critical issues.
- Provide technical support to the GOC Support Specialist team in troubleshooting compute infrastructure related problems.
- Document incident details, resolutions, and lessons learned to enhance future problem-solving.
- Maintain clear, accurate, and up-to-date documentation to promote effective knowledge sharing across the team.
- Communicate effectively with GOC, HPC Engineers, internal teams, stakeholders, and end-users to ensure alignment on issue resolution.
- Take part in team meetings and knowledge-sharing sessions to foster collaboration and continuous learning.
SKILLS AND EXPERIENCE
- Bachelor’s degree in computer engineering, computer science, or a related technical field.
- 5+ years of experience in field service technical areas.
- Strong understanding of server hardware technology, firmware lifecycle, Linux environments and troubleshooting hardware problems, with adherence to physical and system-level security standards.
- Experience with scripting languages (eg: Bash, Python)
- Familiarity with using configuration management, CICD tools, workload manager and cluster softwares (eg: Slurm, Kubernetes, Nvidia BCM) and Observability tools (eg: Prometheus, Grafana, ELK, etc)
- Excellent problem-solving and analytical skills.
- Ability to work independently and as part of a team.
- Strong communication skills, both written and verbal.
Location & Reporting
- Based in: Singapore
- Reporting to: Senior Operations Manager
Employment Basis
Full-time
Diversity
At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.
About Sustainable Metal Cloud
Our vision is to move cloud computing towards net zero, with solutions forged through advanced technology. Partnering with NVIDIA to provide large-scale GPU AI infrastructure.
WHY YOU'LL LOVE WORKING HERE
Our team shares a passion for possibility, knowing that our technology enables ideas across the world. Ideas that can reshape the course of progress and break down traditional boundaries.