NVIDIA H200 now available
Back

Data Centre Engineer, Field Operations

ROLES AND RESPONSIBILITIES

Firmus Technologies is seeking a skilled Data Centre Engineer to join our Operations team, supporting the daily operations and maintenance of our AI-accelerated high-performance computing (HPC) infrastructure. This role will work closely with Field Service Engineers, HPC and Network Engineering teams, and assist the Global Operations Centre (GOC). This is a unique opportunity to contribute directly to the stability and growth of cutting-edge AI infrastructure.


KEY RESPONSIBILITIES

  • Support in the deployment, configuration, and maintenance of various high-end GPU servers, storage servers, networking equipment and software components in highly secure environments.
  • Perform hardware diagnostics, systems functionality and firmware updates as required.
  • Collaborate with engineering teams to assist in tailored customer environments deployment (eg: bare-metal systems, HPC Clusters, Kubernetes, Slurm etc).
  • Serve as first line of engineering support for onsite operational issues, including troubleshooting hardware, network and software problems.
  • Troubleshoot incidents, escalate critical issues and provide feedback to appropriate teams for improvements.
  • Participate in an on-call rotation to ensure 24/7 availability and responsiveness to critical issues.
  • Provide technical support to the GOC Support Specialist team in troubleshooting HPC-related problems.
  • Document incident details, resolutions, and lessons learned to enhance future problem-solving.
  • Maintain clear, accurate, and up-to-date documentation to promote effective knowledge sharing across the team.
  • Communicate effectively with GOC, HPC Engineers, internal teams, stakeholders, and end-users to ensure alignment on issue resolution.
  • Take part in team meetings and knowledge-sharing sessions to foster collaboration and continuous learning.


SKILLS AND EXPERIENCE

  • Bachelor’s degree in computer engineering, computer science, or a related technical field.
  • 5+ years of experience in field service technical areas.
  • Strong understanding of server hardware technology, Linux environments and troubleshooting hardware problems, with adherence to physical and system-level security standards.
  • Experience with scripting languages (eg: Bash, Python)
  • Familiarity with using workload manager and cluster softwares (eg: Slurm, Kubernetes, Nvidia BCM) and Observability tools (eg: Prometheus, Grafana, ELK, etc)
  • Excellent problem-solving and analytical skills.
  • Ability to work independently and as part of a team.
  • Strong communication skills, both written and verbal.

 

LOCATION
Singapore

EMPLOYMENT BASIS
Full Time

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

About Sustainable Metal Cloud

Our vision is to move cloud computing towards net zero, with solutions forged through advanced technology. Partnering with NVIDIA to provide large-scale GPU AI infrastructure.

WHY YOU'LL LOVE WORKING HERE

Our team shares a passion for possibility, knowing that our technology enables ideas across the world. Ideas that can reshape the course of progress and break down traditional boundaries.