AI Cluster Cost Breakdown: CapEx (2026)
An AI cluster's CapEx (capital expenditure) is defined by its Bill of Materials (BOM), the complete list of hardware needed to build it. GPUs account for 60-70% of total cost. Networking runs 10-25%. Storage, power distribution, cabling, and management infrastructure fill the rest. Based on current Blackwell-generation BOMs we've reviewed in 2025-2026, a 16-GPU cluster costs roughly $1M, a 576-GPU deployment runs $36M, and a 24,576-GPU hyperscale cluster would cost roughly $1.15B.
What is a BOM
A BOM (rhymes with "Tom") is the full parts list for a system: every component, its quantity, and its unit price. It might be 10 line items or 200. Some integrators bundle everything into one number: "$36M for a 576-GPU cluster, delivered and racked." Others break it down to individual cable lengths and optical modules.
The detail level matters. If you can't read it, you can't tell whether the storage is oversized for your workload, whether the network design matches what you need, or whether per-unit pricing is competitive. A detailed BOM is also how you compare quotes from competing vendors.
What's in an AI cluster's BOM
An AI cluster's BOM breaks down into seven categories. Proportions shift with scale, GPU generation, and workload, but the categories stay consistent. The pricing and examples below are based on B200 BOMs unless noted otherwise. We also reference Meta's prior-generation H100 cluster as a hyperscale case study.
| Category | Typical % of CapEx | What it covers |
|---|---|---|
| GPU servers | 60-70% | GPU boards, CPUs, RAM, fast storage drives, power supplies, chassis, cooling |
| GPU networking | 10-25% | InfiniBand or Ethernet switches, network adapters, optical transceivers, fiber cabling |
| CPU/management nodes | 2-5% | Servers that run the scheduler, monitoring, and network management |
| Storage servers | 0-15% | Dedicated storage nodes for datasets and checkpoints (workload-dependent) |
| Infrastructure | 2-5% | Racks, power distribution units (PDUs), firewalls, structured cabling |
| Cabling and optics | 3-8% | Fiber runs, copper cables, optical modules that connect cables to switches |
| Software and services | 1-3% | Network management licenses, integration, warranty |
Build Your Cluster (B200)
Drag to see how costs change with scale
| Scale | GPUs | GPU servers | Networking | Storage | Everything else | Total | Per GPU |
|---|---|---|---|---|---|---|---|
| Single server | 8 | $350K | — | — | — | $350K | $44K |
| Small cluster | 16 | $730K | $195K | $75K | $55K | $1.1M | $66K |
| Mid-size cluster | 576 | $24.5M | $6.0M | $1.8M | $2.9M | $35.2M | $61K |
| Hyperscale (est.) | 24,576 | $835.6M | $221.2M | — | $98.3M | $1.2B | $47K |
This ratio holds across GPU generations. In Blackwell deployments we've reviewed, GPUs take roughly two-thirds of total spend. [2]Based on review of multiple 576-GPU B300 cluster deployment quotes (early 2026) Meta's prior-generation H100 cluster showed the same pattern at 65.8%. [1]Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)https://pytorchtoatoms.substack.com/p/metas-24k-h100-cluster-capextco-and
Networking is the second biggest cost, and its share of total CapEx grows with scale. High-speed interconnects like InfiniBand and RoCE (RDMA over Converged Ethernet) connect GPUs across servers into a shared fabric so they can work together on the same training job. A 16-GPU cluster needs a few switches. NDR (400G) InfiniBand switches like the QM9700 run ~$30-35K each as of early 2026. [3]Based on our review of real BOMs and industry pricing (2025-2026) A 576-GPU cluster needs a dozen or more. A 24,576-GPU cluster needs 1,920. Each switch also needs optical transceivers, the small modules that plug into switches and convert signals for fiber cables, at ~$1-1.5K each on both ends. At Meta's scale, transceivers alone cost $88M. [1]Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)https://pytorchtoatoms.substack.com/p/metas-24k-h100-cluster-capextco-and
Networking is worth the spend because it's the cheapest way to make GPUs faster. A 5% training speed-up from better networking gear costs far less than buying enough additional GPUs to get the same improvement, and those extra GPUs also need rack space, cooling, and power. Underinvesting in networking means your GPUs spend more time waiting for data than computing.
Storage is the most variable category. A training cluster with large datasets might dedicate 15% of CapEx to dedicated storage nodes. Pricing varies widely based on capacity, CPU configuration, and NVMe drive count, but a typical all-NVMe storage server (50-200 TB raw) currently runs in the ~$35-40K range. [3]Based on our review of real BOMs and industry pricing (2025-2026) An inference cluster might use zero beyond the NVMe drives, a type of fast solid-state storage, already built into each GPU server.
Inside a GPU server
A GPU server isn't just graphics cards. Each server contains 8 GPUs on an HGX baseboard, NVIDIA's standard board that packages 8 GPUs with high-speed links between them. [5]NVIDIA, "HGX Platform" (accessed March 2026)https://www.nvidia.com/en-us/data-center/hgx/ It also has CPUs, system RAM, fast storage, network adapters, power supplies, and cooling. The GPUs do the training math. Everything else exists to keep data flowing into them fast enough that they're not sitting idle.
Inside a typical 8-GPU server at $250-400K (B200/B300 generation, as of early 2026, based on industry pricing): [3]Based on our review of real BOMs and industry pricing (2025-2026)
| Component | What it does | Typical cost |
|---|---|---|
| HGX GPU board (8 GPUs) | The compute engine. 8 GPUs with high-speed direct links between them so they can share data without going through the network. | $200K-$300K+ |
| CPUs (2x) | Manage data flow between storage and GPUs, run the operating system. | $3K-$15K each |
| RAM (1-3 TB DDR5) | System memory that holds data in queue before it reaches the GPUs. | $5K-$10K |
| NVMe drives (2-10 TB) | Fast solid-state drives for the operating system, training checkpoints, and working datasets. | $3K-$15K |
| Network adapters (8x) | One network adapter per GPU, connecting the server to the rest of the cluster at 400Gbps (NDR) to 800Gbps (XDR) depending on generation. B300 systems ship with XDR (800G). | $1K-$1.5K each |
| BlueField-3 DPUs (1-2x) | Data processing units that offload networking, storage, and security tasks from the CPUs. DGX systems include 2; OEM HGX builds vary. | ~$2-6K each |
| Power supplies, chassis, cooling, rails | Redundant power supplies, server enclosure, fans or liquid cooling, rack-mount hardware. | $3K-$8K combined |
Typical BOM Breakdown (click to explore)
Proportional CapEx by category
| Category | Typical % of CapEx |
|---|---|
| GPU servers | 65% |
| HGX GPU board | 69% |
| OEM integration | 18% |
| CPUs (2x) | 4% |
| Network adapters (8x) | 3% |
| NVMe storage | 3% |
| RAM | 2% |
| Power & chassis | 2% |
| DPU | 1% |
| Networking | 20% |
| InfiniBand switches | 38% |
| Optical transceivers | 28% |
| Fiber & cabling | 15% |
| Ethernet switches | 10% |
| Network management | 5% |
| Network adapters | 4% |
| Everything else | 10% |
| CPU/mgmt nodes | 30% |
| Racks & enclosures | 20% |
| PDUs | 15% |
| Firewall | 15% |
| Cabling | 10% |
| Software & licenses | 10% |
| Storage | 5% |
| NVMe drives | 65% |
| Server hardware | 30% |
| Software | 5% |
The rest of the BOM
Beyond the GPU servers, every cluster needs management nodes, networking equipment, and physical infrastructure. None of these categories are individually dominant, but they add up.
Management nodes (~$15K each) run the job scheduler that assigns work to GPUs, manage the InfiniBand network, and handle monitoring and logging. [3]Based on our review of real BOMs and industry pricing (2025-2026)
InfiniBand is the high-speed network that connects GPU servers into a cluster. Without it, each server is an island. The cost comes from switches (NDR-gen QM9700s run ~$30-35K each as of early 2026), a network adapter in each server for every GPU, optical transceivers on both ends, and the fiber connecting them. Some clusters use RDMA over Converged Ethernet (RoCE) instead of InfiniBand, trading some performance for lower cost and more familiar networking hardware. [3]Based on our review of real BOMs and industry pricing (2025-2026)
Infrastructure covers racks (~$2-3K each), PDUs, the power distribution units that feed electricity to each rack (~$1K each, 2-3 per rack for redundancy), firewalls (~$20-30K), and structured cabling. A 576-GPU cluster uses roughly 40 racks and around 80 PDUs. [2]Based on review of multiple 576-GPU B300 cluster deployment quotes (early 2026) None of these items are individually expensive, but they have long lead times and are easy to forget during budgeting.
Real BOMs at three scales
A single 8-GPU server is a $250-400K purchase. Plug it into facility power and Ethernet. It works. No network management layer. No InfiniBand.
Adding a second server doubles GPU count but introduces networking. Two servers need switches, transceivers, and fiber: $35-50K in hardware that wasn't part of the single-server BOM. The bigger the cluster, the more layers of switches and cabling you need to connect everything, so networking takes a larger share of CapEx at scale.
AI Cluster BOM Breakdown
CapEx by component category
| Cluster | GPU servers | Networking | Storage | Everything else | Total |
|---|---|---|---|---|---|
| 16 GPUs(B200) | $730K (69%) | $195K (18%) | $75K (7%) | $55K (5%) | $1.1M |
| 576 GPUs(B200) | $24.5M (70%) | $6.0M (17%) | $1.8M (5%) | $2.9M (8%) | $35.2M |
| 24,576 GPUs(B200 (est.)) | $836.0M (72%) | $221.0M (19%) | — | $98.0M (8%) | $1.2B |
| Cluster size | GPUs | Approximate CapEx | CapEx per GPU | Networking % |
|---|---|---|---|---|
| Single server (B200) | 8 | $250-400K | ~$30-50K | ~0% |
| Small cluster (B200) | 16 | ~$1M | ~$66K | ~18% |
| Mid-size cluster (B200) | 576 | ~$35M | ~$61K | ~17% |
| Hyperscale (B200 est.) | 24,576 | ~$1.15B | ~$47K | ~19% |
Per-GPU costs drop at hyperscale. At 24,576 GPUs, buyers can negotiate directly with component manufacturers and use an ODM (original design manufacturer) like Quanta instead of an OEM like Dell or SuperMicro, cutting per-server costs by roughly 15-20%. [1]Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)https://pytorchtoatoms.substack.com/p/metas-24k-h100-cluster-capextco-and All rows above use B200 pricing. For reference, Meta's actual 24,576-GPU H100 cluster cost $910M ($37K per GPU), but that used the prior-generation H100 with a cheaper GPU baseboard ($195K vs. ~$280-300K for B200 at OEM pricing). The detailed H100 breakdown appears below. [1]Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)https://pytorchtoatoms.substack.com/p/metas-24k-h100-cluster-capextco-and
16-GPU B200 cluster (~$1M)
A typical small cluster, based on our review of real BOMs: 2 GPU servers (16 B200 GPUs total), a few CPU management nodes, a couple of storage nodes, InfiniBand switches, and a firewall. [3]Based on our review of real BOMs and industry pricing (2025-2026)
| Category | Items | Cost | % |
|---|---|---|---|
| GPU servers | 2x 8-GPU servers | ~$730K | ~70% |
| CPU/management | Management and scheduling nodes | ~$55K | ~5% |
| Storage | Dedicated storage nodes | ~$75K | ~7% |
| InfiniBand | Switches + network management licenses | ~$110K | ~10% |
| Ethernet, firewall, cabling | Switches, firewall, transceivers, fiber | ~$85K | ~8% |
| Total | ~$1.05M |
Storage is the most discretionary category here. Some clusters need dedicated storage nodes for large datasets; others rely entirely on the NVMe drives already inside each GPU server.
576-GPU B300 cluster (~$36M)
This section details a specific B300 deployment. B300 servers run roughly $60-80K more per server than B200 due to the higher-end GPU board, but the proportional BOM breakdown is similar.
A mid-scale deployment based on our review of multiple similar-sized BOMs from early 2026. [2]Based on review of multiple 576-GPU B300 cluster deployment quotes (early 2026) 72 GPU servers (576 GPUs total, 8 per server) with InfiniBand networking, management servers, and data center infrastructure, delivered turnkey.
- 72 GPU servers with InfiniBand networking per server
- ~12 XDR InfiniBand switches (Q-3400, 144 ports each), ~18 Ethernet switches
- ~40 racks, ~80 PDUs, ~1,500 fiber runs
- Rack integration and deployment included
At roughly $36M for 576 GPUs, the all-in cost is about $63,000 per GPU. That includes GPUs, networking, infrastructure, cabling, and physical deployment.
Meta's 24,576-GPU H100 cluster ($910M)
The hyperscale extreme, estimated by Pytorch to Atoms (May 2024). [1]Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)https://pytorchtoatoms.substack.com/p/metas-24k-h100-cluster-capextco-and Meta partnered directly with Quanta to design custom H100 server hardware, bypassing OEM markups entirely.
| Component | Qty | Unit price | Total | % |
|---|---|---|---|---|
| GPU boards (8x H100 each) | 3,072 | $195,000 | $599,040,000 | 65.8% |
| InfiniBand switches (QM9700) | 1,920 | $35,000 | $67,200,000 | 7.4% |
| Optical transceivers | 73,728 | $1,000-$1,300 | $88,474,000 | 9.7% |
| InfiniBand network adapters | 24,576 | $1,200 | $29,491,000 | 3.2% |
| DDR5 RAM | 3,072 | $7,860 | $24,146,000 | 2.7% |
| Intel Xeon CPUs | 6,144 | $2,600 | $15,974,000 | 1.8% |
| Everything else | $86,041,000 | 9.4% | ||
| Total | $910,366,000 |
DPUs, storage drives, Ethernet, fiber, chassis, cooling, power supplies, racks, power distribution, and contract manufacturer markup grouped as "Everything else." Source: Pytorch to Atoms estimates (May 2024).
NVIDIA components account for most of this BOM. GPU boards, InfiniBand switches, and network adapters alone total $696M (76.4% of CapEx). The $88M in optical transceivers flows to a mix of NVIDIA and third-party vendors (InnoLight, Coherent, and others). Either way, NVIDIA is the dominant cost across both compute and networking.
References
- Pytorch to Atoms, "Meta's 24k H100 Cluster Capex/TCO and BoM Analysis" (May 2024)
- Based on review of multiple 576-GPU B300 cluster deployment quotes (early 2026)
- Based on our review of real BOMs and industry pricing (2025-2026)
- NVIDIA, "QM9700 InfiniBand Switch — Specifications" (accessed March 2026)
- NVIDIA, "HGX Platform" (accessed March 2026)
Frequently Asked Questions
What is a BOM for AI cluster CapEx?
An AI cluster's CapEx (capital expenditure) is defined by its Bill of Materials (BOM), the complete list of hardware needed to build it. A BOM is the full parts list for a system: every component, its quantity, and its unit price.
How much of AI cluster CapEx is GPUs versus networking?
GPUs account for 60-70% of total cost. Networking runs 10-25%. In Blackwell deployments we've reviewed, GPUs take roughly two-thirds of total spend.
How much does a 16-GPU, 576-GPU, or 24,576-GPU cluster cost?
Based on current Blackwell-generation BOMs reviewed in 2025-2026, a 16-GPU cluster costs roughly $1M, a 576-GPU deployment runs $36M, and a 24,576-GPU hyperscale cluster costs roughly $1.15B. For reference, Meta's actual 24,576-GPU H100 cluster cost $910M ($37K per GPU).
Why invest in networking for GPU training clusters?
Networking is worth the spend because it's the cheapest way to make GPUs faster. A 5% training speed-up from better networking gear costs far less than buying enough additional GPUs to get the same improvement. Underinvesting in networking means your GPUs spend more time waiting for data than computing.
Coverage creates a minimum value for what your GPUs are worth at a future date. If they sell below the floor, the policy pays you the difference.
Learn how it works →