GPT3 - SMC

Testing notes

Unlike the full training test of GPT-3, which uses tensor parallel and pipeline parallel to train an LLM across multiple GPUs/servers, the fine-tuning of GPT3 uses a new data parallel technology called ZeRO, specifically ZeRO stage 3. This approach reduces the memory requirements of single GPU chips, but it increases communication volume across GPU chips/servers by 1.5 times. The final training module has been converted to the NVIDIA NeMo framework to fully utilize the H100 capability, increasing training performance by up to 50%.

Results

NODE LEVEL

Our verified MLPerf® Training V4.0 submission.

64 nodes/512 gpus

Total Energy Consumed (Joules) - AC	1,699,340,341 J
Total Energy Consumed (Joules) - DC	1,676,757,080 J
Total Energy Consumed (kWh) - DC	465.77 kWh
Total Time (decimal)	56.86942
Total Time (MM:SS)	56:52

Ave Energy Per Node (Network¹ ) - DC	0.931 kW (DC)
Ave Energy Per Node (Node² ) - DC	6.747 kW (DC)
Ave Energy Per Node (Combined) - DC	7.678 kW (DC)

DATA CENTER LEVEL³

Power consumption at the data center level. This is not in scope within MLPerf® Training V4.0 and has not been peer-reviewed by MLCommons members.

64 nodes/512 gpus

Singapre 2 PUE	1.10
Extrap. TTL Energy Consumed	512.342 kWh (DC)
Net CO₂ emitted	213.54 kg

Footnotes: see Disclaimer & Footnotes section below.

Average Run time

The as-submitted results for MLPerf® Training V4.0 for H100 SXM systems.

Total power consumption - total job (node level)

Verified Power results for MLPerf® Training V4.0 submitters. For more detail on how this was captured, refer to sections above

Primary Information

Parameters used in the benchmark run (across all nodes)

Test Name	GPT3
Type	Large Language Model
Framework	PyTorch NVIDIA Release 24.04
Dataset	C4
Submission Date	10 May 2024
Publishing Forum	MLPerf® Training V4.0
Peer reviewed?	YES - MLCOMMONS

Hardware Information

Compute node hardware specifications.

Instance Type	H100 80GB SXM
CPU	Xeon 8462Y+, 128vCPU
Memory	2,048 GB DDR4
Network Cards (RDMA)	8 x ConnectX-7
RDMA	YES
NVLINK	900 GBPS

Environmental & DC Information

Test location & environmental conditions present at test.

Region	Singapore
Availability Zone	Singapore 2 (SIN02)
HyperCube Immersion	Yes
Energy Grid Carbon Intensity	0.405 kg CO2₂-e/kWh (2022)
HyperCube design pPUE	1.02
Facility including HyperCube design PUE	1.10

Storage Cluster

External storage cluster used in benchmark run.

Type	WEKA
Disks	NVME

Network

Compute and storage network details.

Compute Fabric	InfiniBand NDR 200
Contention	Min. 1,600 GBPS uncontended
Storage Fabric	Ethernet
Contention	Peak 200 GBPS

Networking Power Allocation

Overhead allocation of power for networking to test results¹

1 Node	n/a
8 Nodes	16.10 KW
9 Nodes	16.10 kW
64 Nodes	59.57 kW

Disclaimer & Footnotes

Disclaimer & Validity - MLCommons

MLPerf® Training v4.0 Closed GPT3-VBOOST offline. Retrieved from https://mlcommons.org/benchmarks/training/ 12 June 2024. Result verified by MLCommons Association. The MLPerf® name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

Footnotes

Network power allocation: To accurately capture the total power envelope of multi-node tests, it is appropriate to measure the power consumption of networking equipment that is associated with the test. For MLPerf® Training V4.0, a method of proportional allocation of total network power was adopted by the members. For our submission, the following methodology was used and accepted: SMC has a 'Scalable Unit' ('SU') of 6 nodes in 3 bays with a pair of nodes in each bay. Each SU consists of 2 QM9790 64-port Leaf switches, Leaf A and Leaf B. Four out of eight CX-7 NICs in each of the six nodes connect to Leaf A, and the other four CX-7s connect to Leaf B. 16 ports from each Leaf switch connect to 16 Spine switches Spine 01 – Spine 16. Power consumption per switch: 1610W. Port Utilization for leaf switches: (16+4*Num_used_nodes)/64 ports (16 upstream, 24 downstream). Port utilization for spine switches: 2*Number of SUs (each SU has 2 leaf switches each of which connects to 1 port of a spine switch).

Power consumption by cluster size: 8 nodes - 2SUs; 6N in SU1, 2N in SU2 - Total interconnect power 16,100 W. 9 nodes - 2SUs; 6N in SU1, 3N in SU2 - Total interconnect power 16,100 W. 64 nodes - 11 SUs; 6N in SU1-10, 4N in SU11 - Total interconnect power 59,570 W

For each cluster, based on the size, power has been apportioned pro-rata by the time taken to complete the benchmark test. For example, a hypothetical 8 node test that took 15 minutes to complete has its power apportioned as follows: (15/60) * 16,600 = 4,025 W. Measurements: 1 Watt equals 1 Joule per second. 1 kWh equals 3,600,000 Joules.
Node power: SMC records and submits DC power via monitoring of power rectifiers that are upstream of the node. SMC nodes are immersed in HyperCube immersion tanks, arranged in 3 bays, each containing 2 H100 SXM nodes (16 GPUs). Each node is powered by a common 54V bus per bay, with each bus supporting 2 nodes. Each bus is connected to two HyperCube powershelves, equipped with 4 x 5.0 kW power rectifiers. Each rectifier reports statistics, including input/output voltage, current, power and temperature over a CAN bus. This information is available via an HTTP REST endpoint, which is polled and logged at 1 second intervals.

For MLPerf® Training V4.0 results. All runs for all tests were polled and written to an SQLite database. For MLCommons members, the raw data collected was submitted along with our results and verified as accurate and true. The total Joules submitted to MLCommons and displayed on this page include a verified proportion for overhead allowance of networking power, as calculated above.

In instances where Watt, Kilowatt or Kilowatt-hour are displayed, a relevant conversion from Joules and time to complete has been performed.
Data Center PUE & CO₂ calculations: Data center PUE calculations are not in the scope of the MLPerf Training V4.0 results, and have not been peer-reviewed. Data relating to pPUE and PUE is presented based on our measurements, which include calculations using industry-standard methodology. HyperCube pPUE: Partial PUE considers the load within the HyperCube and does not consider any of the supporting infrastructure to support the data center. The observed pPUE of 1.02 is representative of the system's core efficiency gains through the elimination of the fans and chilled water infrastructure to support the main HPC heat load. The Partial PUE includes cooling water pumps, fluid pumps and cooling tower fan energy. Extrapolated PUE: This grossed-up calculation embodies total facility losses to operate a HyperCube data hall, including apportionment of total facility power. An extrapolated PUE of 1.10 is calculated as the SIN02 deployment approaches full capacity. These real-world values were recorded at the time of conducting an assessment based on Greenmark standards. The estimated values are extrapolated from this data and incorporate electrical efficiencies which will be regained with the increased loading of UPS and TX.

Carbon (CO2) calculations: extrapolation of the direct power consumed during the relevant benchmark, including net power required at the data center level, multiplied by the energy grid carbon coefficient present during the test.

MLPerf® Training V4.0:GPT3

MLPerf® Training V4.0:
GPT3