GPU Cluster Networking 101
Training a large model requires splitting the work across multiple servers. After every training step, each GPU computes how the model should change based on its chunk of data, then shares those updates with every other GPU so they all work from the same model on the next step.
The data travels through a chain of hardware: out of GPU memory, through a NIC, over a cable, through switches, and back down to the destination GPU. If any link in that chain is slow, GPUs sit idle waiting. A cluster usually runs multiple networks: one for GPU-to-GPU training traffic, one for storage, and one for management. Networking runs ~20% of total cluster cost. [1]Cost percentages based on BOMs we have reviewed across clusters ranging from 16 to 24,576 GPUs. See our AI Cluster Cost Breakdown for detailed analysis./blog/ai-cluster-cost-breakdown-capex
Topology
To connect all the servers, running a direct cable from each to every other is not practical: the cable count grows quadratically. Instead, servers plug into switches, shared boxes that receive data on one port and forward it out the correct port toward its destination.
A single switch runs out of ports quickly as the number of servers grows, so the natural next step is to group servers onto several switches, then connect those switches with another layer of switches on top. This is called leaf-spine, a type of fat-tree topology. The bottom switches (leaves) connect directly to servers. The top switches (spines) connect the leaves to each other. Data goes up from leaf to spine, then back down to the destination leaf. Multiple spines provide multiple paths, so traffic spreads out instead of funneling through one point.
NVIDIA DGX and HGX reference designs use a rail-optimized wiring pattern on top of leaf-spine. Each of the 8 GPUs in a server gets its own "rail": GPU 0 from every server connects to the same leaf switch (Rail 0), GPU 1 to another leaf (Rail 1), and so on. Most GPU-to-GPU traffic stays within a single rail, which reduces the load on spine switches.
Miswiring a rail, connecting a GPU's NIC to the wrong leaf switch, breaks this locality and forces traffic through extra spine hops. Meta reported that network switch and cable faults caused 8% of unexpected job interruptions during Llama 3 training on 16,384 H100s. [6]Dubey et al., “The Llama 3 Herd of Models,” Meta (2024) — Section 3.3.4 on training reliabilityhttps://arxiv.org/abs/2407.21783
Each leaf switch reserves about half its ports for uplinks to spines. A 64-port leaf only has about 32 ports left for servers. So you need more leaves than the port count suggests, plus the spines on top of that.
What that looks like in practice:
- 16 GPUs (2 servers): A single 64-port switch could handle 2 servers, but reference architectures deploy 2 leaf + 1 spine = 3 switches.
- 192 GPUs (24 servers): 6 switches in two layers.
- 576 GPUs (72 servers): 12 switches, still two layers.
- 24,576 GPUs (3,072 servers): Three layers. The second layer of spines runs out of ports, so a third layer of superspine switches connects groups of spines. 768 leaf + 768 spine + 384 superspine = 1,920 switches.
When every layer has enough switches that all ports can run at full speed simultaneously, the result is called a full fat-tree. A cheaper option is a reduced fat-tree with fewer spines, but then some traffic has to wait. Training clusters use full fat-tree because all GPUs exchange data at the same time. Other topologies exist (3D torus, dragonfly) but fat-tree dominates GPU training.
Switches
Data center switches run custom ASICs (application-specific integrated circuits) that forward packets in silicon. A switch's capacity is its port count multiplied by per-port speed.
Training fabrics need non-blocking switches, where every port runs at full speed simultaneously. Management and storage networks can tolerate blocking (oversubscribed) switches.
Switch naming can be confusing because there are three layers: the chip maker, the ASIC family, and the switch model. NVIDIA makes two ASIC families: Quantum for InfiniBand and Spectrum for Ethernet. Broadcom makes Tomahawk, an Ethernet ASIC that other companies (Arista, Cisco, Edgecore) build into their own switch models. So when you see "Arista 7060X6", that is an Arista switch running a Broadcom Tomahawk 5 chip inside.
InfiniBand switches come in generations named by speed: NDR (next data rate, 400 Gb/s per port) and XDR (eXtreme data rate, 800 Gb/s per port). Ethernet switches use the same speed tiers but are identified by their ASIC rather than a generation name.
| ASIC family | Protocol | Model | Ports | Per-port speed | Role |
|---|---|---|---|---|---|
| NVIDIA Quantum | InfiniBand | QM9700 (NDR) | 64 | 400 Gb/s | Training fabric |
| NVIDIA Quantum | InfiniBand | Q3400 (XDR) | 144 (72 twin-port) | 800 Gb/s | Training fabric (next-gen) |
| NVIDIA Spectrum | Ethernet | SN5600 / SN5610 | 64 | 400-800 Gb/s | Ethernet fabric, RoCE |
| Broadcom Tomahawk 5 | Ethernet | Arista 7060X6, others | 64 | 800 Gb/s | Ethernet fabric, RoCE |
InfiniBand clusters also require UFM (Unified Fabric Manager), NVIDIA's software for routing, health monitoring, and diagnostics across IB switches. It runs on a dedicated server and costs $144-172 per node for a three-year license.
Everything from GPU to switch
The training framework (PyTorch, DeepSpeed) runs on the CPU and calls NCCL (NVIDIA Collective Communications Library), which figures out which data goes where and triggers the transfer through the hardware chain below.
NVLink
The 8 GPUs inside a single server do not use the external network to talk to each other. They communicate through NVLink, NVIDIA's direct GPU-to-GPU interconnect. Bandwidth is 900 GB/s per GPU on Hopper (H100, H200) and 1.8 TB/s on Blackwell (B200, B300).
NVSwitch
NVSwitch is the chip on the server baseboard that connects all 8 GPUs via NVLink. It acts as an on-server crossbar switch so any GPU can send data to any other GPU at full NVLink bandwidth without contention.
NIC
To reach GPUs on other servers, each GPU has a dedicated NIC (network interface card) connected to it over PCIe on the baseboard. The NIC packages data into network packets and hands them to a transceiver, which converts the electrical signal to light for the cable. GPU clusters use NVIDIA's ConnectX NICs. An 8-GPU server uses 8 NICs and 8 switch ports just for training traffic. B200 servers typically ship with ConnectX-7 at 400 Gb/s (gigabits per second) per NIC, B300 servers with ConnectX-8 at 800 Gb/s.
Network speeds are written as Gb/s or shortened to just "G", so a "400G NIC" means 400 gigabits per second. GbE (gigabit Ethernet) is the same idea for Ethernet specifically: 1 GbE = 1 gigabit per second Ethernet.
With standard TCP/IP networking, data would travel GPU → system memory → CPU → NIC, and the CPU would process every packet. That adds latency and eats into GPU compute time. GPU clusters use RDMA (Remote Direct Memory Access) instead. With GPUDirect RDMA, the CPU (via NCCL) tells the NIC which GPU memory addresses to read and where to send them. The NIC then reads directly from GPU memory over PCIe, bypassing both the CPU and system memory. On the receiving side, the NIC writes straight into the destination GPU's memory. The CPU kicks off the transfer but never touches the data. RDMA is extremely sensitive to packet loss — even tiny loss rates can degrade training throughput, which is why training fabrics are engineered to be lossless.
Transceivers and cables
The link between a NIC and a switch starts with a transceiver, an optical module that converts the NIC's electrical signal to light. A fiber cable connects one transceiver to another.
Transceivers are expensive. A single 400G transceiver runs $1,000-2,000, and newer 800G twin-port modules cost several times more. Every link needs a transceiver at each end, so the cost adds up fast. In the clusters we have reviewed, optics and cabling regularly cost more than the switches themselves.
Fiber comes in two types: single-mode fiber (SMF) for long runs up to 500 meters, and multi-mode fiber (MMF) for shorter runs around 100 meters at 400G. For runs under 3-5 meters within the same rack, DAC (Direct Attach Copper) cables skip transceivers entirely, saving cost and power.
Three networks in a cluster
GPU-to-GPU training traffic gets its own dedicated fabric. A cluster runs at least two other networks alongside it for storage and management. InfiniBand clusters use physically separate switches for each, since IB switches cannot carry Ethernet traffic. RoCE clusters can share Ethernet switches across storage and training, bringing that down to two networks.
The management network connects to BMC interfaces (baseboard management controller), the chip on every server that lets you power-cycle, read sensors, and access a remote console without touching the OS. BMC ports are 1 GbE, so this network runs on 1G switches. A 576-GPU cluster we reviewed had 8 management switches for 72 servers.
The storage network carries training datasets and checkpoints between GPUs and network-attached storage at 100-800 Gb/s using NVMe-over-Fabrics (NVMe-oF). Storage servers connect through ConnectX-6 Dx NICs at 100 GbE, and CPU servers plug in through BlueField DPUs.
A firewall sits at the boundary between the cluster's internal networks and the outside network.
| Network | Speed | Traffic | Typical switches | Ports per server |
|---|---|---|---|---|
| Management | 1 Gb/s | BMC, SSH, monitoring | Arista 7010TX, NVIDIA SN2201 | 1 |
| Ethernet fabric | 100-800 Gb/s | Storage, control plane | Tomahawk 5, Spectrum SN5600/SN4700 | 1-2 |
| Training fabric | 400-800 Gb/s | GPU-to-GPU training | QM9700, Q3400 (IB) or Spectrum/Tomahawk (RoCE) | 8 |
InfiniBand or Ethernet
The training fabric speaks one of two protocols: InfiniBand or Ethernet with RoCE (RDMA over Converged Ethernet). That choice determines which switches, cables, and management software you buy.
InfiniBand is purpose-built for high-performance computing. It guarantees lossless delivery: a sender does not transmit until the receiver confirms it has buffer space, so packets are not dropped due to congestion. NVIDIA controls the full InfiniBand stack (ConnectX NICs, Quantum switches, management software), which means the hardware works together out of the box with minimal setup.
Ethernet with RoCE adds RDMA support to standard Ethernet. Lossless behavior requires configuring PFC (Priority Flow Control) and ECN (Explicit Congestion Notification), tied together by the DCQCN congestion control algorithm. [2]Zhu et al., “Congestion Control for Large-Scale RDMA Deployments,” ACM SIGCOMM 2015https://dl.acm.org/doi/10.1145/2785956.2787484 Tools like SONiC and Cumulus Linux now automate much of this configuration, and Meta has demonstrated RoCE at 24,000-GPU scale. [7]“RoCE networks for distributed AI training at scale,” Meta Engineering (August 2024)https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/ The gap in operational difficulty has narrowed, but InfiniBand still requires less network engineering effort. ConnectX NICs support both protocols: the same card runs InfiniBand or Ethernet depending on which switch it plugs into.
As of Q3 2025, Ethernet accounted for more than two-thirds of AI back-end network switch sales, up from less than half a year earlier. [3]AI Back-End Networks Continue Their Shift to Ethernet, Dell’Oro Group (December 2025)https://www.delloro.com/news/ai-back-end-networks-continue-their-shift-to-ethernet-now-accounting-for-over-two-thirds-of-3q-2025-switch-sales-in-ai-clusters/ Hyperscalers like Meta choose Ethernet for the lower cost per port and multi-vendor switch options. Most dedicated training clusters at smaller scale that we have reviewed still use InfiniBand, where the simpler setup outweighs the per-port premium.
Two efforts are working to close the remaining gap. The Ultra Ethernet Consortium (UEC) is building an open Ethernet stack designed for AI and HPC traffic patterns. [4]Ultra Ethernet Consortium, a Linux Foundation project for Ethernet-based AI/HPC networking (founded 2023)https://ultraethernet.org/ NVIDIA's Spectrum-X bundles Spectrum switches with ConnectX SuperNICs, claiming 1.6x higher AI network performance over standard Ethernet.
| InfiniBand | Ethernet (RoCE) | |
|---|---|---|
| Lossless behavior | Built in (credit-based flow control) | Requires PFC/ECN configuration |
| Switch vendors | NVIDIA only | Arista, Broadcom, NVIDIA, others |
| Switch examples | QM9700 (NDR), Q3400 (XDR) | SN5600, Tomahawk 5-based |
| NIC | ConnectX (same card for both) | ConnectX (same card for both) |
| Network management | UFM (required) | SONiC, Cumulus, vendor NOS |
| Setup effort | Low (works out of the box) | Moderate (improving with automation) |
| Latency (end-to-end) | <2 µs [5]NVIDIA ConnectX-7 InfiniBand/VPI Adapter Card Datasheet — sub-1.3 μs InfiniBand end-to-end latencyhttps://www.nvidia.com/en-us/networking/infiniband/connectx-7/ | 2-5 µs (well-tuned RoCE) |
| Cost per port at scale | Higher | Lower |
| Run-to-run consistency | High (predictable tail latency) | Varies with tuning quality |
What networking costs at different scales
Networking runs ~20% of total cluster cost across the BOMs we have reviewed. [1]Cost percentages based on BOMs we have reviewed across clusters ranging from 16 to 24,576 GPUs. See our AI Cluster Cost Breakdown for detailed analysis./blog/ai-cluster-cost-breakdown-capex Use the calculator below to see how quantities and costs scale with GPU count.
16 servers, 4 leaf + 2 spine IB switches, 256 IB links
| Component | Qty | Est. cost |
|---|---|---|
| IB switches (QM9700 (NDR, 400G)) | 6 | $186K |
| Transceivers (IB fabric) | 512 | $563K |
| Fiber / DAC cables (IB fabric) | 256 | $31K |
| Ethernet switches | 4 | $60K |
| Management switches (1G) | 1 | $5K |
| Training NICs (ConnectX) | 128 | $154K |
| UFM servers + licenses | 1 | $21K |
| Firewall | 1 | $25K |
| Estimated networking total | $1.0M |
Actual costs vary significantly by vendor and volume. Assumes InfiniBand training fabric with full fat-tree topology.
Two patterns stand out as you move the slider. First, transceivers grow faster than switches. Every IB link needs a transceiver at each end, and a full fat-tree doubles the link count (server-to-leaf plus leaf-to-spine). Second, the switch generation matters: Q3400 packs 144 ports versus 64 on a QM9700, so larger clusters need fewer physical switches per GPU.
References
- Cost percentages based on BOMs we have reviewed across clusters ranging from 16 to 24,576 GPUs. See our AI Cluster Cost Breakdown for detailed analysis.
- Zhu et al., “Congestion Control for Large-Scale RDMA Deployments,” ACM SIGCOMM 2015
- AI Back-End Networks Continue Their Shift to Ethernet, Dell’Oro Group (December 2025)
- Ultra Ethernet Consortium, a Linux Foundation project for Ethernet-based AI/HPC networking (founded 2023)
- NVIDIA ConnectX-7 InfiniBand/VPI Adapter Card Datasheet — sub-1.3 μs InfiniBand end-to-end latency
- Dubey et al., “The Llama 3 Herd of Models,” Meta (2024) — Section 3.3.4 on training reliability
- “RoCE networks for distributed AI training at scale,” Meta Engineering (August 2024)
Frequently Asked Questions
What percentage of AI cluster cost goes to networking?
Networking runs roughly 20% of total cluster cost. Transceivers grow faster than switches as cluster size increases because every link needs a transceiver at each end, and a full fat-tree doubles the link count (server-to-leaf plus leaf-to-spine). In the clusters we have reviewed, optics and cabling regularly cost more than the switches themselves.
What is leaf-spine topology?
Leaf-spine is the standard network layout for AI clusters. Leaf switches connect directly to servers. Spine switches connect leaves to each other. Data goes up from leaf to spine, then back down to the destination leaf. Multiple spines provide multiple paths, so traffic spreads out instead of funneling through one point.
Should you use InfiniBand or Ethernet for GPU training?
Most dedicated training clusters at smaller scale use InfiniBand, where the simpler setup outweighs the per-port premium. Hyperscalers like Meta choose Ethernet for the lower cost per port and multi-vendor switch options. As of Q3 2025, Ethernet accounted for more than two-thirds of AI back-end network switch sales. ConnectX NICs support both protocols: the same card runs InfiniBand or Ethernet depending on which switch it plugs into.
How many switches does a GPU cluster need?
It depends on scale. 16 GPUs (2 servers) need 3 switches. 192 GPUs (24 servers) need 6 switches in two layers. 576 GPUs (72 servers) need 12. At 24,576 GPUs (3,072 servers), the fabric hits three layers: 768 leaf + 768 spine + 384 superspine = 1,920 switches total.
Coverage creates a minimum value for what your GPUs are worth at a future date. If they sell below the floor, the policy pays you the difference.
Learn how it works →