AI Cluster Cost Breakdown: CapEx (2026)

·Bernie Margulies

An AI cluster's CapEx (capital expenditure) is defined by its Bill of Materials (BOM), the complete list of hardware needed to build it. GPUs account for 60-70% of total cost. Networking runs 10-25%. Storage, power distribution, cabling, and management infrastructure fill the rest. Based on current Blackwell-generation BOMs we've reviewed in 2025-2026, a 16-GPU cluster costs roughly $1M, a 576-GPU deployment runs $36M, and a 24,576-GPU hyperscale cluster would cost roughly $1.15B.

What is a BOM

A BOM (rhymes with "Tom") is the full parts list for a system: every component, its quantity, and its unit price. It might be 10 line items or 200. Some integrators bundle everything into one number: "$36M for a 576-GPU cluster, delivered and racked." Others break it down to individual cable lengths and optical modules.

The detail level matters. If you can't read it, you can't tell whether the storage is oversized for your workload, whether the network design matches what you need, or whether per-unit pricing is competitive. A detailed BOM is also how you compare quotes from competing vendors.

What's in an AI cluster's BOM

An AI cluster's BOM breaks down into seven categories. Proportions shift with scale, GPU generation, and workload, but the categories stay consistent. The pricing and examples below are based on B200 BOMs unless noted otherwise. We also reference Meta's prior-generation H100 cluster as a hyperscale case study.

CategoryTypical % of CapExWhat it covers
GPU servers60-70%GPU boards, CPUs, RAM, fast storage drives, power supplies, chassis, cooling
GPU networking10-25%InfiniBand or Ethernet switches, network adapters, optical transceivers, fiber cabling
CPU/management nodes2-5%Servers that run the scheduler, monitoring, and network management
Storage servers0-15%Dedicated storage nodes for datasets and checkpoints (workload-dependent)
Infrastructure2-5%Racks, power distribution units (PDUs), firewalls, structured cabling
Cabling and optics3-8%Fiber runs, copper cables, optical modules that connect cables to switches
Software and services1-3%Network management licenses, integration, warranty

Build Your Cluster (B200)

Drag to see how costs change with scale

576 GPUs72 servers
$35.2Mtotal · $61K per GPU
GPU servers
$24.5M70%
Networking
$6.0M17%
Storage
$1.8M5%
Everything else
$2.9M8%
ScaleGPUsGPU serversNetworkingStorageEverything elseTotalPer GPU
Single server8$350K$350K$44K
Small cluster16$730K$195K$75K$55K$1.1M$66K
Mid-size cluster576$24.5M$6.0M$1.8M$2.9M$35.2M$61K
Hyperscale (est.)24,576$835.6M$221.2M$98.3M$1.2B$47K

This ratio holds across GPU generations. In Blackwell deployments we've reviewed, GPUs take roughly two-thirds of total spend. [2]Based on review of multiple 576-GPU B300 cluster deployment quotes (early 2026) Meta's prior-generation H100 cluster showed the same pattern at 65.8%. [1]Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)https://pytorchtoatoms.substack.com/p/metas-24k-h100-cluster-capextco-and

Networking is the second biggest cost, and its share of total CapEx grows with scale. High-speed interconnects like InfiniBand and RoCE (RDMA over Converged Ethernet) connect GPUs across servers into a shared fabric so they can work together on the same training job. A 16-GPU cluster needs a few switches. NDR (400G) InfiniBand switches like the QM9700 run ~$30-35K each as of early 2026. [3]Based on our review of real BOMs and industry pricing (2025-2026) A 576-GPU cluster needs a dozen or more. A 24,576-GPU cluster needs 1,920. Each switch also needs optical transceivers, the small modules that plug into switches and convert signals for fiber cables, at ~$1-1.5K each on both ends. At Meta's scale, transceivers alone cost $88M. [1]Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)https://pytorchtoatoms.substack.com/p/metas-24k-h100-cluster-capextco-and

Networking is worth the spend because it's the cheapest way to make GPUs faster. A 5% training speed-up from better networking gear costs far less than buying enough additional GPUs to get the same improvement, and those extra GPUs also need rack space, cooling, and power. Underinvesting in networking means your GPUs spend more time waiting for data than computing.

Storage is the most variable category. A training cluster with large datasets might dedicate 15% of CapEx to dedicated storage nodes. Pricing varies widely based on capacity, CPU configuration, and NVMe drive count, but a typical all-NVMe storage server (50-200 TB raw) currently runs in the ~$35-40K range. [3]Based on our review of real BOMs and industry pricing (2025-2026) An inference cluster might use zero beyond the NVMe drives, a type of fast solid-state storage, already built into each GPU server.

Inside a GPU server

A GPU server isn't just graphics cards. Each server contains 8 GPUs on an HGX baseboard, NVIDIA's standard board that packages 8 GPUs with high-speed links between them. [5]NVIDIA, "HGX Platform" (accessed March 2026)https://www.nvidia.com/en-us/data-center/hgx/ It also has CPUs, system RAM, fast storage, network adapters, power supplies, and cooling. The GPUs do the training math. Everything else exists to keep data flowing into them fast enough that they're not sitting idle.

Inside a typical 8-GPU server at $250-400K (B200/B300 generation, as of early 2026, based on industry pricing): [3]Based on our review of real BOMs and industry pricing (2025-2026)

ComponentWhat it doesTypical cost
HGX GPU board (8 GPUs)The compute engine. 8 GPUs with high-speed direct links between them so they can share data without going through the network.$200K-$300K+
CPUs (2x)Manage data flow between storage and GPUs, run the operating system.$3K-$15K each
RAM (1-3 TB DDR5)System memory that holds data in queue before it reaches the GPUs.$5K-$10K
NVMe drives (2-10 TB)Fast solid-state drives for the operating system, training checkpoints, and working datasets.$3K-$15K
Network adapters (8x)One network adapter per GPU, connecting the server to the rest of the cluster at 400Gbps (NDR) to 800Gbps (XDR) depending on generation. B300 systems ship with XDR (800G).$1K-$1.5K each
BlueField-3 DPUs (1-2x)Data processing units that offload networking, storage, and security tasks from the CPUs. DGX systems include 2; OEM HGX builds vary.~$2-6K each
Power supplies, chassis, cooling, railsRedundant power supplies, server enclosure, fans or liquid cooling, rack-mount hardware.$3K-$8K combined

Typical BOM Breakdown (click to explore)

Proportional CapEx by category

GPU servers65%
Networking20%
Everything else10%
Storage5%
Click a category to drill down
CategoryTypical % of CapEx
GPU servers65%
HGX GPU board69%
OEM integration18%
CPUs (2x)4%
Network adapters (8x)3%
NVMe storage3%
RAM2%
Power & chassis2%
DPU1%
Networking20%
InfiniBand switches38%
Optical transceivers28%
Fiber & cabling15%
Ethernet switches10%
Network management5%
Network adapters4%
Everything else10%
CPU/mgmt nodes30%
Racks & enclosures20%
PDUs15%
Firewall15%
Cabling10%
Software & licenses10%
Storage5%
NVMe drives65%
Server hardware30%
Software5%

The rest of the BOM

Beyond the GPU servers, every cluster needs management nodes, networking equipment, and physical infrastructure. None of these categories are individually dominant, but they add up.

Management nodes (~$15K each) run the job scheduler that assigns work to GPUs, manage the InfiniBand network, and handle monitoring and logging. [3]Based on our review of real BOMs and industry pricing (2025-2026)

InfiniBand is the high-speed network that connects GPU servers into a cluster. Without it, each server is an island. The cost comes from switches (NDR-gen QM9700s run ~$30-35K each as of early 2026), a network adapter in each server for every GPU, optical transceivers on both ends, and the fiber connecting them. Some clusters use RDMA over Converged Ethernet (RoCE) instead of InfiniBand, trading some performance for lower cost and more familiar networking hardware. [3]Based on our review of real BOMs and industry pricing (2025-2026)

Infrastructure covers racks (~$2-3K each), PDUs, the power distribution units that feed electricity to each rack (~$1K each, 2-3 per rack for redundancy), firewalls (~$20-30K), and structured cabling. A 576-GPU cluster uses roughly 40 racks and around 80 PDUs. [2]Based on review of multiple 576-GPU B300 cluster deployment quotes (early 2026) None of these items are individually expensive, but they have long lead times and are easy to forget during budgeting.

Real BOMs at three scales

A single 8-GPU server is a $250-400K purchase. Plug it into facility power and Ethernet. It works. No network management layer. No InfiniBand.

Adding a second server doubles GPU count but introduces networking. Two servers need switches, transceivers, and fiber: $35-50K in hardware that wasn't part of the single-server BOM. The bigger the cluster, the more layers of switches and cabling you need to connect everything, so networking takes a larger share of CapEx at scale.

AI Cluster BOM Breakdown

CapEx by component category

$1.1M
69%
18%
16 GPUs
B200
$35.2M
70%
17%
8%
576 GPUs
B200
$1.2B
72%
19%
8%
24,576 GPUs
B200 (est.)
ClusterGPU serversNetworkingStorageEverything elseTotal
16 GPUs(B200)$730K (69%)$195K (18%)$75K (7%)$55K (5%)$1.1M
576 GPUs(B200)$24.5M (70%)$6.0M (17%)$1.8M (5%)$2.9M (8%)$35.2M
24,576 GPUs(B200 (est.))$836.0M (72%)$221.0M (19%)$98.0M (8%)$1.2B
Cluster sizeGPUsApproximate CapExCapEx per GPUNetworking %
Single server (B200)8$250-400K~$30-50K~0%
Small cluster (B200)16~$1M~$66K~18%
Mid-size cluster (B200)576~$35M~$61K~17%
Hyperscale (B200 est.)24,576~$1.15B~$47K~19%

Per-GPU costs drop at hyperscale. At 24,576 GPUs, buyers can negotiate directly with component manufacturers and use an ODM (original design manufacturer) like Quanta instead of an OEM like Dell or SuperMicro, cutting per-server costs by roughly 15-20%. [1]Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)https://pytorchtoatoms.substack.com/p/metas-24k-h100-cluster-capextco-and All rows above use B200 pricing. For reference, Meta's actual 24,576-GPU H100 cluster cost $910M ($37K per GPU), but that used the prior-generation H100 with a cheaper GPU baseboard ($195K vs. ~$280-300K for B200 at OEM pricing). The detailed H100 breakdown appears below. [1]Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)https://pytorchtoatoms.substack.com/p/metas-24k-h100-cluster-capextco-and

16-GPU B200 cluster (~$1M)

A typical small cluster, based on our review of real BOMs: 2 GPU servers (16 B200 GPUs total), a few CPU management nodes, a couple of storage nodes, InfiniBand switches, and a firewall. [3]Based on our review of real BOMs and industry pricing (2025-2026)

CategoryItemsCost%
GPU servers2x 8-GPU servers~$730K~70%
CPU/managementManagement and scheduling nodes~$55K~5%
StorageDedicated storage nodes~$75K~7%
InfiniBandSwitches + network management licenses~$110K~10%
Ethernet, firewall, cablingSwitches, firewall, transceivers, fiber~$85K~8%
Total~$1.05M

Storage is the most discretionary category here. Some clusters need dedicated storage nodes for large datasets; others rely entirely on the NVMe drives already inside each GPU server.

576-GPU B300 cluster (~$36M)

This section details a specific B300 deployment. B300 servers run roughly $60-80K more per server than B200 due to the higher-end GPU board, but the proportional BOM breakdown is similar.

A mid-scale deployment based on our review of multiple similar-sized BOMs from early 2026. [2]Based on review of multiple 576-GPU B300 cluster deployment quotes (early 2026) 72 GPU servers (576 GPUs total, 8 per server) with InfiniBand networking, management servers, and data center infrastructure, delivered turnkey.

  • 72 GPU servers with InfiniBand networking per server
  • ~12 XDR InfiniBand switches (Q-3400, 144 ports each), ~18 Ethernet switches
  • ~40 racks, ~80 PDUs, ~1,500 fiber runs
  • Rack integration and deployment included

At roughly $36M for 576 GPUs, the all-in cost is about $63,000 per GPU. That includes GPUs, networking, infrastructure, cabling, and physical deployment.

Meta's 24,576-GPU H100 cluster ($910M)

The hyperscale extreme, estimated by Pytorch to Atoms (May 2024). [1]Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)https://pytorchtoatoms.substack.com/p/metas-24k-h100-cluster-capextco-and Meta partnered directly with Quanta to design custom H100 server hardware, bypassing OEM markups entirely.

ComponentQtyUnit priceTotal%
GPU boards (8x H100 each)3,072$195,000$599,040,00065.8%
InfiniBand switches (QM9700)1,920$35,000$67,200,0007.4%
Optical transceivers73,728$1,000-$1,300$88,474,0009.7%
InfiniBand network adapters24,576$1,200$29,491,0003.2%
DDR5 RAM3,072$7,860$24,146,0002.7%
Intel Xeon CPUs6,144$2,600$15,974,0001.8%
Everything else$86,041,0009.4%
Total$910,366,000

DPUs, storage drives, Ethernet, fiber, chassis, cooling, power supplies, racks, power distribution, and contract manufacturer markup grouped as "Everything else." Source: Pytorch to Atoms estimates (May 2024).

NVIDIA components account for most of this BOM. GPU boards, InfiniBand switches, and network adapters alone total $696M (76.4% of CapEx). The $88M in optical transceivers flows to a mix of NVIDIA and third-party vendors (InnoLight, Coherent, and others). Either way, NVIDIA is the dominant cost across both compute and networking.

References

  1. Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)
  2. Based on review of multiple 576-GPU B300 cluster deployment quotes (early 2026)
  3. Based on our review of real BOMs and industry pricing (2025-2026)
  4. NVIDIA, "QM9700 InfiniBand Switch — Specifications" (accessed March 2026)
  5. NVIDIA, "HGX Platform" (accessed March 2026)

Frequently Asked Questions

What is a BOM for AI cluster CapEx?

An AI cluster's CapEx (capital expenditure) is defined by its Bill of Materials (BOM), the complete list of hardware needed to build it. A BOM is the full parts list for a system: every component, its quantity, and its unit price.

How much of AI cluster CapEx is GPUs versus networking?

GPUs account for 60-70% of total cost. Networking runs 10-25%. In Blackwell deployments we've reviewed, GPUs take roughly two-thirds of total spend.

How much does a 16-GPU, 576-GPU, or 24,576-GPU cluster cost?

Based on current Blackwell-generation BOMs reviewed in 2025-2026, a 16-GPU cluster costs roughly $1M, a 576-GPU deployment runs $36M, and a 24,576-GPU hyperscale cluster costs roughly $1.15B. For reference, Meta's actual 24,576-GPU H100 cluster cost $910M ($37K per GPU).

Why invest in networking for GPU training clusters?

Networking is worth the spend because it's the cheapest way to make GPUs faster. A 5% training speed-up from better networking gear costs far less than buying enough additional GPUs to get the same improvement. Underinvesting in networking means your GPUs spend more time waiting for data than computing.

Residual Value Insurance Solutions for GPUs

Coverage creates a minimum value for what your GPUs are worth at a future date. If they sell below the floor, the policy pays you the difference.

Learn how it works →
AI Cluster Cost Breakdown: CapEx (2026) | American Compute