Senior Solution Architect: Sean Zhan on Computability

Looking back, the cost of computation has reduced dramatically. With nearly 20 years of experience in the industry, Sean Zhan talks about the beginning of the HPC revolution and how it affects computation today.

1. Hi Sean, let’s talk through what sparked your passion for MLOps & HPC, and how it led to your current role at SMC.

I first got into HPC during university — it’s been interesting to see how HPC and cluster computing technology have grown over the last 20 years. The last 10 to 15 years have been especially exciting, with the rise of AI. People discovered that by leveraging the mature distributed computing technology of HPC, AI workloads are no longer limited by the computing power of a single computer.

However, when we scale the cluster, not only does the complexity of the system increase but also its power consumption. The rapidly growing neural network model size has pushed sustainability challenges to unprecedented heights.

It wasn't until SMC caught my attention that I found real solutions to these challenges.

2. Can you share your perspective on the evolving role of Artificial Neural Networks in advancing AI applications, particularly in relation to sustainability?

If we look back, the cost of computation has been reduced dramatically: I vividly remember my time at IBM, working on the world's first PetaFlops supercomputing system, RoadRunner. We were amazed by the powerful computational capability of that system. Yet, fast forward 17 years to GTC 2024, and Nvidia has unveiled the GB200 NVL72, 1.4 ExaFlops in just a single rack.

On the other hand, we see computing requirements growing even faster than the evolution of chip technology, due to both, a single AI model volume and the total number of AI applications growing exponentially:

Very soon, a 100B large language model will be considered a 'small model'
Also, we will see LLMs running from your cell phone or mobile device

Edge computing and computing centres reinforce each other, resulting in a positive cycle of computing power expansion. All these expansions eventually challenge the sustainability of the IT industry and our planet.

3. Navigating through the complexities of AI and HPC, can you share a project at SMC that you're particularly proud of?

That has to be my first project in SMC — building SMC’s GPU cloud.

In my work over the past 20 years, I have been exposed to many different HPC technologies and tools. This is the first time that I have used almost everything I have learned over the past 20 years on a project.

More importantly, I have had the privilege of working with many of SMC’s geniuses. Their excellent technology, rigorous attitude, and almost fanatical work enthusiasm allowed us to create a miracle. In just 2 months, we completed the construction of immersion-cooled hardware and the deployment of our GPU cloud software, straight to final testing and delivery.

I'm so pleased to see that we are not just building a platform, but building an entire evolution cycle from customer to product to technology.

4. In your experience, how does SMC prioritise the balance between reliability and sustainability while evaluating the computability of its systems?

It’s not about balancing — we consider reliability, computability, security, and manageability as core parts of our system's ‘consumability’.

We don’t choose to compromise, we maximise. People often have to select one thing over another due to traditional methodology. To manage a cloud platform, sustainability, consumability, and cost are an impossible triangle. In most cases, cost is always a challenge, so people have to make a choice between consumability and sustainability.

When SMC started to develop their unique immersion cooling technology, both, sustainability and consumability were the baselines of the system. Compromising either meant the technology would not be able to handle the future of AI/GPU expansion. It’s why SMC’s technology is a world-first.

5. HPC is critical for AI research and development. How does SMC integrate sustainability into its HPC operations?

HPC technology was proposed 40 to 50 years ago, and there are many definitions of it. I still like its traditional definition, ‘Leveraging the network to aggregate computing resources on different computers, to complete a computing task which is beyond a single server’s capability’.

Humans will always have some problems that exceed the capabilities of a single computer. Therefore, over the years, HPC technology has been continuously applied in many industries. Meanwhile, the technology itself has also continued to evolve to solve new computing problems:

In the development process of AI applications, especially the training process of large language models, the computing task has far exceeded the computing power of a single server. To solve this problem, many traditional HPC technologies are used: including network linking methods, task distribution, and information distribution and aggregation during the calculation process. Of course, people have also made many improvements to make these technologies applicable to new GPU architectures.

SMC follows NVidia's SuperPOD architecture, but in a more sustainable way: presently all GPU servers are immersion-cooled. In the future, we are planning to apply immersion cooling to more components, including switches, storage and CPU servers.

On top of infrastructure, by leveraging GPU direct technology, network virtualization, multi-tenant, infrastructure as code, DevOps, etc, we operate our system in a secure, efficient, and reliable way.

6. What emerging technologies or methodologies do you believe will significantly influence the sustainability of AI and HPC?

If I can divide the technology for AI and HPC into 3 categories, software, platform and infrastructure:

From the software perspective, I feel we are at the edge of another AI technology innovation. From instinct to self-reflection: most of our current AI models are working as human instinct, i.e. instant reaction without long-term thinking. The new AI technology will be able to review and re-abstract the knowledge by itself, which can dramatically increase the intelligence level of AI. However, this will increase the token per dollar cost, which brings more challenges to the sustainability of infrastructure.

Cloud platforms must adapt to the increased demand for inferencing requirements, transitioning from the previously emphasised training requirements, due to the surge in new AI applications on both internet and edge devices.

On the infrastructure front, the cost of managing such infrastructure, including hardware, power, and cooling technologies, continues to rise. The trend is moving towards centralised, dedicated AI/GPU cloud services from traditional small data centres. This is because of resource efficiency as well as new sustainability measures being implemented.