NVIDIA H200 now available
Enquire nowLog in

MLPerf® Training V4.0:
GPT3-VBOOST

Unofficial result. Not verified by MLCommons Association.

Testing notes

This unverified result is the output of the same testing conditions and container as our verified GPT 64 H100 node submitted result, with the only variance being the enabling of the VBOOST flag within the container. Owing to time constraints, this result was not submitted to MLCommons members. The result noted for NVIDIA is their verified MLPerf® result for this node/test combination.

Our process to capture and report the data for this unverified result was identical to all other submitted and verified results.

Results
NODE LEVEL

Our verified MLPerf® Training V4.0 submission.

64 nodes/512 gpus
Total Energy Consumed (Joules) - AC 1,625,724,000 J
Total Energy Consumed (Joules) - DC 1,603,476,000 J
Total Energy Consumed (kWh) - DC 445.41 kWh
Total Time (decimal) 54.11667
Total Time (MM:SS) 54:07
   
Ave Energy Per Node (Network1 ) - DC 0.931 kW (DC)
Ave Energy Per Node (Node2 ) - DC 6.785 kW (DC)
Ave Energy Per Node (Combined) - DC 7.716 kW (DC)
DATA CENTRE LEVEL³

Power consumption at the data centre level. This is not in scope within MLPerf® Training V4.0, and has not been peer reviewed by MLCommons members.

64 nodes/512 gpus
Singapre 2 PUE 1.10
Extrap. TTL Energy Consumed 489.951 kWh (DC)
Net CO2 emitted 204.21 kg

Footnotes: see Disclaimer & Footnotes section below.

Average Run time

The as-submitted results for MLPerf® Training V4.0 for H100 SXM systems.

Total power consumption - total job (node level)

Verified Power results for MLPerf® Training V4.0 submitters. For more detail on how this was captured, refer to sections above

Primary Information

Parameters used in the benchmark run (across all nodes)

Test Name GPT3
Type Large Language Model
Framework PyTorch NVIDIA Release 24.04
Dataset C4
Submission Date 10 May 2024
Publishing Forum MLPerf® Training V4.0
Peer reviewed? YES - MLCOMMONS
Hardware Information

Compute node hardware specifications.

Instance Type H100 80GB SXM
CPU Xeon 8462Y+, 128vCPU
Memory 2,048 GB DDR4
Network Cards (RDMA) 8 x ConnectX-7
RDMA YES
NVLINK 900 GBPS
Environmental & DC Information

Test location & environmental conditions present at test.

Region Singapore
Availability Zone Singapore 2 (SIN02)
HyperCube Immersion Yes
Energy Grid Carbon Intensity 0.405 kg CO22-e/kWh (2022)
HyperCube design pPUE 1.02
Facility including HyperCube design PUE 1.10
Storage Cluster

External storage cluster used in benchmark run.

Type WEKA
Disks NVME
Network

Compute and storage network details.

Compute Fabric InfiniBand NDR 200
Contention Min. 1,600 GBPS uncontended
Storage Fabric Ethernet
Contention Peak 200 GBPS
Networking Power Allocation

Overhead allocation of power for networking to test results¹

1 Node n/a
8 Nodes 16.10 KW
9 Nodes 16.10 kW
64 Nodes 59.57 kW
Disclaimer & Footnotes

Disclaimer & Validity - MLCommons

Unverified MLPerf® Training v4.0 Closed GPT3 offline. The MLPerf® name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

This unverified result is the output of the same testing conditions and container as our verified GPT 64 H100 node submitted result, with the only variance being the enabling of the VBOOST flag within the container. Owing to time constraints, this result was not submitted to MLCommons members. The result noted for NVIDIA is their verified MLPerf® result for this node/test combination.


Footnotes

  1. Network power allocation: To accurately capture the total power envelope of multi-node tests, it is appropriate to measure the power consumption of networking equipment that is associated with the test. For MLPerf® Training V4.0, a method of proportional allocation of total network power was adopted by the members. For our submission, the following methodology was used and accepted: SMC has a 'Scalable Unit' ('SU') of 6 nodes in 3 bays with a pair of nodes in each bay. Each SU consists of 2 QM9790 64-port Leaf switches, Leaf A and Leaf B. Four out of eight CX-7 NICs in each of the six nodes connect to Leaf A, and the other four CX-7s connect to Leaf B. 16 ports from each Leaf switch connect to 16 Spine switches Spine 01 – Spine 16. Power consumption per switch: 1610W. Port Utilization for leaf switches: (16+4*Num_used_nodes)/64 ports (16 upstream, 24 downstream). Port utilization for spine switches: 2*Number of SUs (each SU has 2 leaf switches each of which connects to 1 port of a spine switch).

    Power consumption by cluster size: 8 nodes - 2SUs; 6N in SU1, 2N in SU2 - Total interconnect power 16,100 W. 9 nodes - 2SUs; 6N in SU1, 3N in SU2 - Total interconnect power 16,100 W. 64 nodes - 11 SUs; 6N in SU1-10, 4N in SU11 - Total interconnect power 59,570 W

    For each cluster, based on the size, power has been apportioned pro-rata by the time taken to complete the benchmark test. For example, a hypothetical 8 node test that took 15 minutes to complete has its power apportioned as follows: (15/60) * 16,600 = 4,025 W. Measurements: 1 Watt equals 1 Joule per second. 1 kWh equals 3,600,000 Joules.

  2. Node power: SMC records and submits DC power via monitoring of power rectifiers that are upstream of the node. SMC nodes are immersed in HyperCube immersion tanks, arranged in 3 bays, each containing 2 H100 SXM nodes (16 GPUs). Each node is powered by a common 54V bus per bay, with each bus supporting 2 nodes. Each bus is connected to two HyperCube powershelves, equipped with 4 x 5.0 kW power rectifiers. Each rectifier reports statistics, including input/output voltage, current, power and temperature over a CAN bus. This information is available via an HTTP REST endpoint, which is polled and logged at 1 second intervals.

    For MLPerf® Training V4.0 results. All runs for all tests were polled and written to an SQLite database. For MLCommons members, the raw data collected was submitted along with our results and verified as accurate and true. The total Joules submitted to MLCommons and displayed on this page include a verified proportion for overhead allowance of networking power, as calculated above.

    In instances where Watt, Kilowatt or Kilowatt-hour are displayed, a relevant conversion from Joules and time to complete has been performed.

  3. Data Centre PUE & CO calculations: Data centre PUE calculations are not in scope of the MLPerf Training V4.0 results, and have not been peer reviewed. Data relating to pPUE and PUE is presented based on our own measurements, which includes calculations using industry standard methodology. HyperCube pPUE: Partial PUE considers the load within the HyperCube and does not consider any of the supporting infrastructure to support the data centre. The observed pPUE of 1.02 is representative of the system's core efficiency gains through the elimination of the fans and chilled water infrastructure to support the main HPC heatload. The Partial PUE includes cooling water pumps, fluid pumps and cooling tower fan energy. Extrapolated PUE: This grossed up calculation embodies total facility losses to operate a HyperCube data hall, including an apportionment of total facility power. An extrapolated PUE of 1.10 is calculated as the SIN02 deployment approaches full capacity. These real-world values were recorded at the time of conducting an assessment based on Greenmark standards. The estimated values are extrapolated from this data and incorporate electrical efficiencies which will be regained with the increased loading of UPS and TX.

    Carbon (CO2) calculations: extrapolation of the direct power consumed during the relevant benchmark, including net power required at the data center level, multiplied by the energy grid carbon coefficient present during the test.