Enquire nowLog in

30 September 2024

Interviews

Solutions Architect: Derek Ngo on scaling efficiency goals

At Sustainable Metal Cloud, we continuously push the boundaries of technology, to set new goals for ourselves. One of our standout achievements this year was the release of a world-first power benchmark, certified by MLPerf, which highlights our leadership in power efficiency and sustainability. Our platform not only performed on par with or exceeded the performance of competing clusters but also achieved up to 45% energy savings, using the NVIDIA SuperPOD reference architecture. A key contributor to this milestone was Derek Ngo, our Solutions Architect, who played a crucial role in delivering these exceptional MLPerf results.

Considering your past work experience, what would you say has prepared you for the challenges of optimising AI infrastructure at scale?

When it comes to AI, GPUs are often the first thing that comes to mind. While GPUs are indeed crucial for powering AI, the supporting infrastructure, such as networking, storage and security, is equally important for an optimal setup. It is critical to develop a solution that allows various components to work cohesively to achieve maximum performance and efficiency. Thankfully, my experience over the years has placed me in a good position to address many of the challenges with large scale AI deployments. The ability to balance high-performance GPU setups with the necessary networking and storage optimisations has allowed me to deliver solutions that meet both performance and sustainability goals.

Drawing from your experience in systems engineering, what innovations have you been able to implement or have been implemented at SMC to optimise GPU performance and sustainability, especially for large-scale deployments?

At SMC, we are fortunate to have a team of specialists who bring deep expertise across multiple areas—GPUs, networking, storage, and security. As the size of each deployment increases, so does its complexity. Having experts with deep knowledge of these various areas is great in the long run when we’re developing optimal solutions for immediate requirements while ensuring flexibility for easy scalability in future.

Critical performance tuning and verification are carried out at each layer to ensure the optimal performance of a cluster. I’m pleased to be a part of it and contribute to the team’s overall success!

Immersion cooling technology is a key aspect of SMC’s infrastructure. How has this technology impacted the performance and energy efficiency of SMC’s AI clusters (particularly during your work on MLPerf benchmarks)?

By eliminating the need for high-powered fans and cooling requirements, we significantly reduce energy consumption compared to traditional air-cooled environments. Our immersion cooling ensures a consistent thermal state of the GPU nodes, maximising availability while maintaining consistent performance throughout the benchmarks.

“ We have successfully demonstrated that SMC’s AI clusters match the performance of air-cooled clusters while using over 40% less direct energy consumption. ”

As someone with deep experience in delivery and systems engineering, what are the biggest deployment challenges and/or wins you’ve faced at SMC?

As a small island state, Singapore faces significant constraints on space and energy resources. SMC's immersion cooling technology offers an ideal solution to overcome these challenges. By optimising data centre space, we can deploy large-scale AI infrastructure while significantly reducing cooling requirements and carbon footprint. This approach not only enhances efficiency but also positions us competitively against other regions by doing more with less.

What do you see as an upcoming trend in AI infrastructure? How is SMC positioned to stay ahead of the curve?

With each new generation, GPUs become increasingly power-hungry in the quest for performance. Powering and cooling a 100kW+ rack, which was unheard of just a few years ago, is now a reality. This higher power consumption generates more heat and demands more energy for cooling, driving up the overall Total Cost of Ownership (TCO) and putting immense strain on traditional data centre setups.

SMC is well-positioned to tackle these challenges, thanks to our extensive R&D in immersion cooling. Our immersion technology eliminates the need for fans, resulting in no moving parts and, consequently, greater long-term reliability. Lower energy consumption, reduced carbon emissions for sustainable AI growth.