MLPerf® Training V4.0:
GPT3
Result verified by MLCommons Association.
Unlike the full training test of GPT-3, which uses tensor parallel and pipeline parallel to train an LLM across multiple GPUs/servers, the fine-tuning of GPT3 uses a new data parallel technology called ZeRO, specifically ZeRO stage 3. This approach reduces the memory requirements of single GPU chips, but it increases communication volume across GPU chips/servers by 1.5 times. The final training module has been converted to the NVIDIA NeMo framework to fully utilize the H100 capability, increasing training performance by up to 50%.
Our verified MLPerf® Training V4.0 submission.
Total Energy Consumed (Joules) - AC | 1,699,340,341 J |
---|---|
Total Energy Consumed (Joules) - DC | 1,676,757,080 J |
Total Energy Consumed (kWh) - DC | 465.77 kWh |
Total Time (decimal) | 56.86942 |
Total Time (MM:SS) | 56:52 |
Ave Energy Per Node (Network1 ) - DC | 0.931 kW (DC) |
Ave Energy Per Node (Node2 ) - DC | 6.747 kW (DC) |
Ave Energy Per Node (Combined) - DC | 7.678 kW (DC) |
Power consumption at the data centre level. This is not in scope within MLPerf® Training V4.0, and has not been peer reviewed by MLCommons members.
Singapre 2 PUE | 1.10 |
---|---|
Extrap. TTL Energy Consumed | 512.342 kWh (DC) |
Net CO2 emitted | 213.54 kg |
Footnotes: see Disclaimer & Footnotes section below.
The as-submitted results for MLPerf® Training V4.0 for H100 SXM systems.
Verified Power results for MLPerf® Training V4.0 submitters. For more detail on how this was captured, refer to sections above
Parameters used in the benchmark run (across all nodes)
Test Name | GPT3 |
---|---|
Type | Large Language Model |
Framework | PyTorch NVIDIA Release 24.04 |
Dataset | C4 |
Submission Date | 10 May 2024 |
Publishing Forum | MLPerf® Training V4.0 |
Peer reviewed? | YES - MLCOMMONS |
Compute node hardware specifications.
Instance Type | H100 80GB SXM |
---|---|
CPU | Xeon 8462Y+, 128vCPU |
Memory | 2,048 GB DDR4 |
Network Cards (RDMA) | 8 x ConnectX-7 |
RDMA | YES |
NVLINK | 900 GBPS |
Test location & environmental conditions present at test.
Region | Singapore |
---|---|
Availability Zone | Singapore 2 (SIN02) |
HyperCube Immersion | Yes |
Energy Grid Carbon Intensity | 0.405 kg CO22-e/kWh (2022) |
HyperCube design pPUE | 1.02 |
Facility including HyperCube design PUE | 1.10 |
External storage cluster used in benchmark run.
Type | WEKA |
---|---|
Disks | NVME |
Compute and storage network details.
Compute Fabric | InfiniBand NDR 200 |
---|---|
Contention | Min. 1,600 GBPS uncontended |
Storage Fabric | Ethernet |
Contention | Peak 200 GBPS |
Overhead allocation of power for networking to test results¹
1 Node | n/a |
---|---|
8 Nodes | 16.10 KW |
9 Nodes | 16.10 kW |
64 Nodes | 59.57 kW |
Disclaimer & Validity - MLCommons
MLPerf® Training v4.0 Closed GPT3-VBOOST offline. Retrieved from https://mlcommons.org/benchmarks/training/ 12 June 2024. Result verified by MLCommons Association. The MLPerf® name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.
Footnotes
- Network power allocation: To accurately capture the total power envelope of multi-node tests, it is appropriate to measure the power consumption of networking equipment that is associated with the test. For MLPerf® Training V4.0, a method of proportional allocation of total network power was adopted by the members. For our submission, the following methodology was used and accepted: SMC has a 'Scalable Unit' ('SU') of 6 nodes in 3 bays with a pair of nodes in each bay. Each SU consists of 2 QM9790 64-port Leaf switches, Leaf A and Leaf B. Four out of eight CX-7 NICs in each of the six nodes connect to Leaf A, and the other four CX-7s connect to Leaf B. 16 ports from each Leaf switch connect to 16 Spine switches Spine 01 – Spine 16. Power consumption per switch: 1610W. Port Utilization for leaf switches: (16+4*Num_used_nodes)/64 ports (16 upstream, 24 downstream). Port utilization for spine switches: 2*Number of SUs (each SU has 2 leaf switches each of which connects to 1 port of a spine switch).
Power consumption by cluster size: 8 nodes - 2SUs; 6N in SU1, 2N in SU2 - Total interconnect power 16,100 W. 9 nodes - 2SUs; 6N in SU1, 3N in SU2 - Total interconnect power 16,100 W. 64 nodes - 11 SUs; 6N in SU1-10, 4N in SU11 - Total interconnect power 59,570 W
For each cluster, based on the size, power has been apportioned pro-rata by the time taken to complete the benchmark test. For example, a hypothetical 8 node test that took 15 minutes to complete has its power apportioned as follows: (15/60) * 16,600 = 4,025 W. Measurements: 1 Watt equals 1 Joule per second. 1 kWh equals 3,600,000 Joules. - Node power: SMC records and submits DC power via monitoring of power rectifiers that are upstream of the node. SMC nodes are immersed in HyperCube immersion tanks, arranged in 3 bays, each containing 2 H100 SXM nodes (16 GPUs). Each node is powered by a common 54V bus per bay, with each bus supporting 2 nodes. Each bus is connected to two HyperCube powershelves, equipped with 4 x 5.0 kW power rectifiers. Each rectifier reports statistics, including input/output voltage, current, power and temperature over a CAN bus. This information is available via an HTTP REST endpoint, which is polled and logged at 1 second intervals.
For MLPerf® Training V4.0 results. All runs for all tests were polled and written to an SQLite database. For MLCommons members, the raw data collected was submitted along with our results and verified as accurate and true. The total Joules submitted to MLCommons and displayed on this page include a verified proportion for overhead allowance of networking power, as calculated above.
In instances where Watt, Kilowatt or Kilowatt-hour are displayed, a relevant conversion from Joules and time to complete has been performed. - Data Centre PUE & CO₂ calculations: Data centre PUE calculations are not in scope of the MLPerf Training V4.0 results, and have not been peer reviewed. Data relating to pPUE and PUE is presented based on our own measurements, which includes calculations using industry standard methodology. HyperCube pPUE: Partial PUE considers the load within the HyperCube and does not consider any of the supporting infrastructure to support the data centre. The observed pPUE of 1.02 is representative of the system's core efficiency gains through the elimination of the fans and chilled water infrastructure to support the main HPC heatload. The Partial PUE includes cooling water pumps, fluid pumps and cooling tower fan energy. Extrapolated PUE: This grossed up calculation embodies total facility losses to operate a HyperCube data hall, including an apportionment of total facility power. An extrapolated PUE of 1.10 is calculated as the SIN02 deployment approaches full capacity. These real-world values were recorded at the time of conducting an assessment based on Greenmark standards. The estimated values are extrapolated from this data and incorporate electrical efficiencies which will be regained with the increased loading of UPS and TX.
Carbon (CO2) calculations: extrapolation of the direct power consumed during the relevant benchmark, including net power required at the data center level, multiplied by the energy grid carbon coefficient present during the test.