NVIDIA H200 now available
Enquire nowLog in

MLPerf® Training V4.0:
Stable Diffusionv2

Result verified by MLCommons Association.

Testing notes

Stable Diffusion v2 is a powerful computer program that turns text descriptions into detailed images. It uses a technique called latent diffusion, where it gradually refines a random noise pattern into a picture that matches the given text. This new version is trained on a wider range of data, resulting in more creative and higher-quality images compared to its predecessor. Additionally, it comes with a built-in upscaler model that can enhance the resolution of generated images, making them even more impressive.

For Stable Diffusion v2, due to deadlines, we were unable to submit power results on time.

Results
NODE LEVEL

Our verified MLPerf® Training V4.0 submission.

8 nodes/64 gpus
Total Energy Consumed (Joules) - AC n/a
Total Energy Consumed (Joules) - DC n/a
Total Energy Consumed (kWh) - DC n/a
Total Time (decimal) 7.26450
Total Time (MM:SS) 07:16
   
Ave Energy Per Node (Network1 ) - DC n/a
Ave Energy Per Node (Node2 ) - DC n/a
Ave Energy Per Node (Combined) - DC n/a
DATA CENTRE LEVEL³

Power consumption at the data centre level. This is not in scope within MLPerf® Training V4.0, and has not been peer reviewed by MLCommons members.

8 nodes/64 gpus
Singapore 2 PUE 1.10
Extrap. TTL Energy Consumed n/a
Net CO2 emitted n/a

Footnotes: see Disclaimer & Footnotes section below.

Average Run time

The as-submitted results for MLPerf® Training V4.0 for H100 SXM systems.

Primary Information

Parameters used in the benchmark run (across all nodes)

Test Name Stable Diffusionv2
Type Image Generation
Framework PyTorch NVIDIA Release 24.04
Dataset LAION-400M-filtered
Submission Date 10 May 2024
Publishing Forum MLPerf® Training V4.0
Peer reviewed? YES - MLCOMMONS
Hardware Information

Compute node hardware specifications.

Instance Type H100 80GB SXM
CPU Xeon 8462Y+, 128vCPU
Memory 2,048 GB DDR4
Network Cards (RDMA) 8 x ConnectX-7
RDMA YES
NVLINK 900 GBPS
Environmental & DC Information

Test location & environmental conditions present at test.

Region Singapore
Availability Zone Singapore 2 (SIN02)
HyperCube Immersion Yes
Energy Grid Carbon Intensity 0.405 kg CO22-e/kWh (2022)
HyperCube design pPUE 1.02
Facility including HyperCube design PUE 1.10
Storage Cluster

External storage cluster used in benchmark run.

Type WEKA
Disks NVME
Network

Compute and storage network details.

Compute Fabric InfiniBand NDR 200
Contention Min. 1,600 GBPS uncontended
Storage Fabric Ethernet
Contention Peak 200 GBPS
Networking Power Allocation

Overhead allocation of power for networking to test results¹

1 Node n/a
8 Nodes 16.10 KW
9 Nodes 16.10 kW
64 Nodes 59.57 kW
Disclaimer & Footnotes

Disclaimer & Validity - MLCommons

MLPerf® Training v4.0 Closed Stable Diffusionv2 offline. Retrieved from https://mlcommons.org/benchmarks/training/ 12 June 2024. Result verified by MLCommons Association. The MLPerf® name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.


Footnotes

  1. Network power allocation: To accurately capture the total power envelope of multi-node tests, it is appropriate to measure the power consumption of networking equipment that is associated with the test. For MLPerf® Training V4.0, a method of proportional allocation of total network power was adopted by the members. For our submission, the following methodology was used and accepted: SMC has a 'Scalable Unit' ('SU') of 6 nodes in 3 bays with a pair of nodes in each bay. Each SU consists of 2 QM9790 64-port Leaf switches, Leaf A and Leaf B. Four out of eight CX-7 NICs in each of the six nodes connect to Leaf A, and the other four CX-7s connect to Leaf B. 16 ports from each Leaf switch connect to 16 Spine switches Spine 01 – Spine 16. Power consumption per switch: 1610W. Port Utilization for leaf switches: (16+4*Num_used_nodes)/64 ports (16 upstream, 24 downstream). Port utilization for spine switches: 2*Number of SUs (each SU has 2 leaf switches each of which connects to 1 port of a spine switch).

    Power consumption by cluster size: 8 nodes - 2SUs; 6N in SU1, 2N in SU2 - Total interconnect power 16,100 W. 9 nodes - 2SUs; 6N in SU1, 3N in SU2 - Total interconnect power 16,100 W. 64 nodes - 11 SUs; 6N in SU1-10, 4N in SU11 - Total interconnect power 59,570 W

    For each cluster, based on the size, power has been apportioned pro-rata by the time taken to complete the benchmark test. For example, a hypothetical 8 node test that took 15 minutes to complete has its power apportioned as follows: (15/60) * 16,600 = 4,025 W. Measurements: 1 Watt equals 1 Joule per second. 1 kWh equals 3,600,000 Joules.

  2. Node power: SMC records and submits DC power via monitoring of power rectifiers that are upstream of the node. SMC nodes are immersed in HyperCube immersion tanks, arranged in 3 bays, each containing 2 H100 SXM nodes (16 GPUs). Each node is powered by a common 54V bus per bay, with each bus supporting 2 nodes. Each bus is connected to two HyperCube powershelves, equipped with 4 x 5.0 kW power rectifiers. Each rectifier reports statistics, including input/output voltage, current, power and temperature over a CAN bus. This information is available via an HTTP REST endpoint, which is polled and logged at 1 second intervals.

    For MLPerf® Training V4.0 results. All runs for all tests were polled and written to an SQLite database. For MLCommons members, the raw data collected was submitted along with our results and verified as accurate and true. The total Joules submitted to MLCommons and displayed on this page include a verified proportion for overhead allowance of networking power, as calculated above.

    In instances where Watt, Kilowatt or Kilowatt-hour are displayed, a relevant conversion from Joules and time to complete has been performed.

  3. Data Centre PUE & CO calculations: Data centre PUE calculations are not in scope of the MLPerf Training V4.0 results, and have not been peer reviewed. Data relating to pPUE and PUE is presented based on our own measurements, which includes calculations using industry standard methodology. HyperCube pPUE: Partial PUE considers the load within the HyperCube and does not consider any of the supporting infrastructure to support the data centre. The observed pPUE of 1.02 is representative of the system's core efficiency gains through the elimination of the fans and chilled water infrastructure to support the main HPC heatload. The Partial PUE includes cooling water pumps, fluid pumps and cooling tower fan energy. Extrapolated PUE: This grossed up calculation embodies total facility losses to operate a HyperCube data hall, including an apportionment of total facility power. An extrapolated PUE of 1.10 is calculated as the SIN02 deployment approaches full capacity. These real-world values were recorded at the time of conducting an assessment based on Greenmark standards. The estimated values are extrapolated from this data and incorporate electrical efficiencies which will be regained with the increased loading of UPS and TX.

    Carbon (CO2) calculations: extrapolation of the direct power consumed during the relevant benchmark, including net power required at the data center level, multiplied by the energy grid carbon coefficient present during the test.