Senior Network Engineer, AI Infrastructure
ROLES AND RESPONSIBILITIES
Firmus Technologies is seeking a skilled Senior Network Engineer specialising in AI networks to join our Cloud Architecture and Software Defined Infrastructure team.
The ideal candidate will play a crucial role in network design, configuration, and deployment for AI infrastructure projects. This role offers an exciting opportunity to work at the forefront of AI networking technology and contribute to the growth of AI infrastructure.
- Primary responsibilities will include design and building bespoke AI infrastructure for new and existing customers.
- Support operational and reliability aspects of large-scale AI clusters with a focus on performance at scale, real-time monitoring, logging, and alerting.
- Provide specialist network engineering support to ensure optimal operation of network software and hardware.
- Develop high quality automation and scripts to operate network infrastructure at scale.
- Engage in and improve the whole lifecycle of services – from inception and design through deployment, operation, and refinement.
- Improve internal tooling by identifying automation opportunities to drive speed and scale in our capabilities.
- Be the subject matter expert for networking-related escalations.
- Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
SKILLS AND EXPERIENCE
- B.Sc in Computer Science/Electrical/Mechanical Engineering or equivalent experience.
- Hands-on experience in solving problems in large-scale RDMA over Converged Ethernet (RoCE) or InfiniBand network environments.
- Strong hands-on experience in Linux-based platforms.
- In-depth knowledge of network protocols and tools and management of security measures for network infrastructure.
- Familiarity with data path hardware acceleration protocols and interfaces such as RDMA, RoCE, InfiniBand etc.
- Familiarity with Infrastructure as Code practices. Experience in developing IaC to support automation.
- Experience in using network automation tools such as Terraform, Ansible, Puppet, and Python scripts.
- Familiarity with Linux networking, using device API and firewall policy management.
- Experience with switching and routing network protocols.
- Fast and independent self-learner with outstanding technical skills.
- Driven and focused on customer needs and satisfaction.
- Self-motivated with excellent leadership skills.
- Strong written, verbal, and listening skills are essential.
KEY COMPETENCIES
- CCIE or equivalent networking certifications and certification in Linux systems.
- 5+ years of experience with AI, HPC, or parallel network architectures.
- Proficiency in Infrastructure as Code (IaC) tools (e.g. Ansible, Netbox, Python scripts).
- Understanding of how MPI, RDMA, and NCCL works, as well as an understanding of how job schedules (SLRUM, PBS) work.
- Proven knowledge of Python or Bash.
- Professional Services/Infrastructure Specialists delivery experience.
LOCATION
Singapore
EMPLOYMENT BASIS
Full-Time
About Sustainable Metal Cloud
Our vision is to move cloud computing towards net zero, with solutions forged through advanced technology. Partnering with NVIDIA to provide large-scale GPU AI infrastructure.
WHY YOU'LL LOVE WORKING HERE
Our team shares a passion for possibility, knowing that our technology enables ideas across the world. Ideas that can reshape the course of progress and break down traditional boundaries.