Every GPU Infrastructure Term You Need to Know
A GPU is a chip smaller than your hand. Getting useful work out of it requires a board, a server, a network, a building, and a power grid.
This guide follows that stack from the inside out, starting with the workload and zooming out one layer at a time until we reach the electricity bill.
The Workload
An AI model is a file of numbers. The models behind ChatGPT, Gemini, and Claude have hundreds of billions of parameters, numerical weights that encode everything the model has learned. Those weights are the product of training and the input to inference.
Training is how a model learns. You feed it massive datasets and the system adjusts billions of parameters over weeks or months until the model produces useful outputs. Training a frontier model from scratch can cost tens of millions of dollars in compute alone. [1]Epoch AI, "How much does it cost to train frontier AI models?" (2024)https://epoch.ai/blog/how-much-does-it-cost-to-train-frontier-ai-models It is the most hardware-intensive workload in AI.
Inference is running the trained model to get answers. Every time you type a prompt into ChatGPT, that's inference. It requires less raw compute per request but needs fast response times, and at scale the cost adds up quickly because inference runs continuously.
Both reduce to the same underlying math:
A typical neural network is ... made up of a sandwich of only two operations: matrix multiplication and thresholding at zero.
Matrix multiplication, billions of times over. That single fact determines what hardware you need.
The Chip
A CPU (Central Processing Unit) is the general-purpose processor in every computer. It's fast at sequential tasks but processes operations mostly one at a time. A GPU (Graphics Processing Unit), originally designed for rendering graphics, has thousands of small cores that run matrix math in parallel. That architecture maps almost perfectly onto the workload described above, which is why GPUs dominate AI.
GPUs are not the only option. Google's TPUs, AMD's Instinct series, and custom ASICs from companies like Cerebras are all AI accelerators optimized for the same math. When someone in this industry says "accelerator," they usually mean "not a CPU." But NVIDIA controls roughly 80% of the data center GPU market, [2]Reuters, "Nvidia pursues $30 billion custom chip opportunity" (February 2024)https://www.reuters.com/technology/nvidia-chases-30-billion-custom-chip-market-with-new-unit-sources-2024-02-09/ which is why most of this guide focuses on their hardware.
Form Factors and Memory
Not all GPUs are installed the same way. The two main form factors are SXM and PCIe.
SXM is NVIDIA's high-performance socket. SXM GPUs mount onto a dedicated board designed specifically for them, with high-speed wiring between GPU slots built in. They draw roughly 2x the power and deliver higher memory and interconnect bandwidth than GPUs that use the standard slot. [3]NVIDIA H100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/h100/ Every serious training system uses SXM. The tradeoff: you need the matching board, so you can't swap in GPUs from a different vendor.
PCIe is the standard slot used in every server and desktop for add-in cards (GPUs, network cards, storage controllers). PCIe GPUs are simpler to deploy and cheaper, but the connection between GPUs is slower, which limits how fast they can share data during training. They work well for inference, small-scale training, and mixed workloads.
The memory on a GPU is called HBM (High Bandwidth Memory). It's physically layered on top of the GPU processor itself, connected by thousands of tiny wires, which is what makes it so fast. HBM comes in generations: HBM2 (A100 40GB), HBM2e (A100 80GB), HBM3 (H100 80GB), HBM3e (H200 141GB). You'll also hear it called VRAM (Video RAM); in the data center context, they mean the same thing. More memory means larger models fit on one GPU without splitting them across multiple cards. That matters because splitting a model forces GPUs to constantly exchange intermediate results, and that communication is slow relative to the math.
Every GPU has a power rating: TDP (Thermal Design Power), the maximum watts the cooling system must be able to remove. An H100 SXM is rated at 700W. [3]NVIDIA H100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/h100/ A B200 reaches 1,000W. [4]NVIDIA DGX B200 System Datasheethttps://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet These numbers set the requirements for every power supply, cable, and cooling system further up the stack.
NVIDIA's Generations
NVIDIA releases a new GPU architecture roughly every two years. Each generation resets what counts as "current" hardware and reprices everything that came before it.
| Generation | Key GPUs | Memory | Notable |
|---|---|---|---|
| Ampere (2020) | A100 | 40/80 GB HBM2e | Training workhorse for 3+ years. Now widely available on the secondary market at a fraction of original price. |
| Hopper (2023) | H100, H200 | 80 GB HBM3 / 141 GB HBM3e | Largest installed base of any training GPU. [13]Epoch AI, "Data on AI Chip Sales" (2026)https://epoch.ai/data/ai-chip-sales Introduced the Transformer Engine for LLM attention layers. |
| Blackwell (2025) | B200, B300, GB200, GB300 | 192 GB HBM3e | ~2–2.5x Hopper per GPU. [5]NVIDIA, "Blackwell Enables 3x Faster Training" (2025)https://developer.nvidia.com/blog/nvidia-blackwell-enables-3x-faster-training-and-nearly-2x-training-performance-per-dollar-than-previous-gen-architecture/ GB200 pairs two GPUs + one Grace ARM CPU. Requires liquid cooling at full power. |
| Rubin (exp. H2 2026) | Vera Rubin NVL72 | 288 GB HBM4* | Rack-scale: 72 GPUs + 36 CPUs + NVLink in one rack. 3.3x Blackwell inference claimed. [6]NVIDIA Vera Rubin NVL72 Product Pagehttps://www.nvidia.com/en-us/data-center/vera-rubin-nvl72/ |
* Manufacturer projections, not independent benchmarks.
The Server
A GPU can't run on its own. It needs power, cooling, a CPU to orchestrate it, memory to stage data, and storage to read from. All of that lives inside a server, a dedicated computer designed to run continuously in a data center rather than on someone's desk.
The Baseboard
SXM GPUs don't plug into a standard motherboard. They mount onto a dedicated baseboard, a physical board with high-speed wiring between GPU slots built in. You can't use SXM GPUs without one.
NVIDIA's baseboard is called HGX. It holds 4 or 8 SXM GPUs with direct high-speed connections between them. When you hear "HGX H100" or "HGX B200," that means 8 GPUs on one HGX board inside a server. Dell, HPE, Supermicro, and Lenovo all build their GPU servers around it.
On the open-standard side, there's OAM (OCP Accelerator Module), designed by the Open Compute Project. AMD's MI300X uses OAM form factor. The goal is vendor-neutral GPU module design, but NVIDIA's dominance means HGX still leads in deployments.
GPU-to-GPU Communication
When 8 GPUs sit on the same baseboard, they need to exchange data constantly during training. The speed of that communication is often the bottleneck.
NVLink is NVIDIA's proprietary high-speed link between GPUs within a server. NVLink 4.0 (Hopper) provides 900 GB/s bidirectional bandwidth per GPU. [7]NVIDIA NVLink and NVLink Switchhttps://www.nvidia.com/en-us/data-center/nvlink/ That speed is what allows 8 GPUs to work on the same model efficiently. Without it, GPUs fall back to the standard PCIe bus, the data pathway that connects CPUs, storage, and NICs in any server. Current gen is PCIe 5.0 at roughly 64 GB/s per x16 slot which is over 10x slower than NVLink.
Connecting all GPUs at full NVLink speed requires NVSwitch, a dedicated chip on the HGX baseboard. It creates an all-to-all topology: every GPU can talk to every other GPU at full bandwidth. Without NVSwitch, NVLink connections are point-to-point and only reach neighboring GPUs.
Everything Else in the Box
GPUs account for 60–70% of a server's cost, but every other component in the box can bottleneck them.
Most GPU servers use 1–2 Intel Xeon or AMD EPYC CPUs for data loading, preprocessing, job scheduling, and system management. The CPU doesn't do the AI math. It matters less than the GPU, but undersizing it can starve your data pipeline.
Memory and storage serve different roles. RAM (DDR4 or DDR5) is memory: fast, volatile, and used for data the CPU and GPUs need right now. It holds the OS, orchestration state, and data being staged for the GPUs. A typical training server has 512GB–2TB, cheap relative to the GPUs. NVMe SSDs (2–8 drives per server) are storage: persistent and slower, used for datasets, model checkpoints, and scratch space. When the server powers off, storage keeps the data; memory does not.
The PSU (Power Supply Unit) converts AC power from the facility to DC for the server. GPU servers need 2–6 PSUs for redundancy and capacity. An 8×H100 server draws roughly 10kW; a DGX B200 peaks at roughly 15kW. [4]NVIDIA DGX B200 System Datasheethttps://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet
Every server also has a BMC (Baseboard Management Controller), a small embedded computer on the server's board that lets you remotely power on/off, monitor temperatures, update firmware, and troubleshoot without physically touching the machine. Essential when your hardware is in a colocation facility across the country.
These servers are measured in rack units (U), the standard unit of height in a server rack. 1U = 1.75 inches. A typical 8-GPU training server is 4U or 8U. A standard rack is 42U tall, so you fit roughly 5 eight-GPU servers in an 8U configuration, or up to 10 in a 4U configuration.
The Cluster
One server isn't enough. A single machine with 8 GPUs can fine-tune small models, but training anything at the frontier requires hundreds or thousands of GPUs working together. That means connecting servers.
Connect multiple servers with high-speed networking and you have a cluster. Each server in a cluster is called a node. A cluster can be as small as 2 nodes or scale to thousands with 100,000+ GPUs total, all working together as one system. A company's total deployed hardware across all locations is its fleet.
The Network Fabric
The gold standard for cluster networking is InfiniBand (IB), a high-bandwidth, low-latency fabric originally from HPC (supercomputing). NVIDIA acquired its maker, Mellanox, in 2020. [8]NVIDIA, "NVIDIA Completes Acquisition of Mellanox" (April 2020)https://nvidianews.nvidia.com/news/nvidia-completes-acquisition-of-mellanox-creating-major-force-driving-next-gen-data-centers Current generation InfiniBand is NDR, running at 400 Gb/s per port. [9]NVIDIA Quantum-2 InfiniBand NDR Architecture Datasheethttps://docs.nvidia.com/networking/display/qm97x0pub/introduction
InfiniBand is the default choice for multi-node GPU training because it natively supports RDMA (Remote Direct Memory Access), a protocol that lets one server read or write another server's memory directly, bypassing the CPU and OS entirely. Normal TCP/IP networking adds too much overhead for distributed training. RDMA eliminates it.
The alternative is Ethernet with RoCE (RDMA over Converged Ethernet), the standard networking protocol adapted for GPU workloads with RDMA support. It's cheaper, uses commodity switches, and closing the performance gap with each generation. Many inference clusters and some training clusters run on RoCE. Broadcom and NVIDIA both sell RoCE-capable NICs.
Each server connects to the network through NICs (Network Interface Cards). In AI clusters, each server typically has multiple: high-speed ports for GPU traffic (InfiniBand or RoCE) and standard Ethernet for management and storage. Some clusters use DPUs (Data Processing Units), essentially smart NICs with their own processor that offload networking, security, and storage tasks from the main CPU. NVIDIA's BlueField is the most common. Useful at scale, overkill for small deployments.
Topology
Topology is how the switches and cables are arranged. The standard layout for AI clusters is called spine-leaf: one layer of switches connects directly to servers (leaf), and a second layer connects those switches to each other (spine). The result is that every server is the same distance from every other server, so performance is predictable.
Within that layout, you can wire for maximum speed between any two servers (full fat-tree, more switches, more cables, more expensive) or cut networking costs substantially by using fewer connections (rail-optimized, enough for most inference workloads but less suited for large training runs that need all-to-all communication).
The Facility
Every watt a cluster draws becomes heat. Removing that heat, and supplying that power in the first place, is one of the hardest infrastructure problems in AI.
Cooling
Air cooling is the simplest approach: fans and heatsinks push air across components. It's well-understood, has lower capital cost, and works for GPUs up to roughly 700W TDP, which covers Hopper. But it's not sufficient for Blackwell at full power.
Direct liquid cooling (DLC) solves this. Cold plates mount directly on GPUs (and sometimes CPUs) with liquid circulating through them. DLC is required for Blackwell-class GPUs at full power draw. It's more efficient than air, but requires plumbing, a coolant distribution unit, and facility support. As of 2024, about 22% of data center operators use DLC, and most deploy it on fewer than 10% of their racks. Another 61% say they're considering it. [10]Uptime Institute Global Data Center Survey (2024)https://intelligence.uptimeinstitute.com/resource/uptime-institute-global-data-center-survey-2024
A middle ground is the rear-door heat exchanger (RDHx), a liquid cooling unit mounted on the back door of a standard rack that intercepts hot exhaust air and cools it before it enters the room. It lets you increase rack density without full DLC. Useful as a transitional solution.
The liquid in a DLC system is managed by a CDU (Coolant Distribution Unit), the pump and heat exchange system that circulates coolant through the loop. It sits near the racks and connects to the facility's chilled water supply. One CDU typically serves one or a few racks.
How efficiently a facility handles all this is measured by PUE (Power Usage Effectiveness): total facility power divided by IT equipment power. A PUE of 1.0 would mean every watt goes to compute. No facility achieves that. The overhead goes to cooling systems, power conversion losses, lighting, and building infrastructure. Typical data centers run a PUE of 1.4–1.6, [10]Uptime Institute Global Data Center Survey (2024)https://intelligence.uptimeinstitute.com/resource/uptime-institute-global-data-center-survey-2024 meaning 30–40% of facility power goes to non-compute overhead. Well-run modern facilities achieve 1.2–1.3, and purpose-built liquid-cooled facilities target 1.1–1.2. Google reports a fleet-wide PUE of 1.09. [11]Google Data Centers, "Efficiency: How we measure PUE"https://datacenters.google/efficiency
The Data Center
A data center is a facility purpose-built to house servers: reinforced floors, redundant power feeds, cooling systems, physical security, fire suppression. Building one from scratch takes years and costs tens of millions before you install a single server.
Most operators skip that entirely and use colocation (colo), a business arrangement where you own the hardware and a facility provider supplies the physical space, power, cooling, and security. You ship your servers to their building.
Data centers are classified by the Uptime Institute into Tiers I through IV. Tier I has no redundancy and allows about 29 hours of downtime per year. Tier III has redundant power and cooling paths, allowing about 1.6 hours per year. Tier IV is fully fault-tolerant at under 30 minutes per year. [12]Uptime Institute, "Tier Classification System"https://uptimeinstitute.com/tiers Most GPU clusters target Tier III.
Inside the facility, servers mount into racks: standard 19-inch-wide enclosures measured in rack units. Standard height is 42U, though AI racks are increasingly taller (48U–52U) and deeper to accommodate GPU servers.
Managing all of this at scale requires DCIM (Data Center Infrastructure Management) software, which monitors power, cooling, capacity, and assets across the facility. It tracks which rack has capacity, how much power is being drawn, and when cooling is approaching limits.
Power Density
Not all data centers can handle AI workloads. Power density is how much power a single rack draws. A traditional enterprise rack draws 5–10kW. A rack filled with 8-GPU H100 servers draws 40–60kW. Blackwell racks can exceed 100kW.
The problem is not the electricity bill. The problem is that most existing facilities were wired, cooled, and physically built for 5–10kW per rack. Upgrading means new power distribution (often rewiring from 208V to 415V), new cooling infrastructure, and sometimes reinforced floors to handle heavier equipment. These upgrades take months to years.
The Supply Chain
The hardware is built, racked, and cooled. But who actually makes it, and how does it get to you?
Most enterprise GPU servers come from OEMs (Original Equipment Manufacturers): Dell, HPE, Supermicro, Lenovo. They buy NVIDIA GPUs and components, assemble complete branded servers, and provide warranties and support.
Behind the OEMs sit ODMs (Original Design Manufacturers) like Foxconn, Quanta, Wiwynn, and Inventec. These are contract manufacturers that build servers for hyperscalers and sometimes sell white-label systems. Lower cost, less support, higher volume. Not typical for small cluster buyers.
At the other end of the scale, the hyperscalers (AWS, Microsoft Azure, Google Cloud) operate their own data centers and design custom hardware. For smaller operators, hyperscalers represent the "rent" side of the buy-vs-rent decision: you pay per GPU-hour instead of owning anything.
Between hyperscalers and full ownership are the neoclouds (CoreWeave, Lambda, Together, Crusoe), companies that buy GPU hardware and sell compute as a service. More control than a hyperscaler, lower cost than owning outright. Their economics depend on utilization rates and hardware residual value, meaning how much the equipment is worth when the next generation ships.
The Economics
Every layer we've covered (chip, server, cluster, facility, supply chain) collapses into one question: what does it cost, and is it cheaper to own or rent?
CapEx (Capital Expenditure) is the upfront cost of buying hardware: GPUs, servers, networking, infrastructure. A single 8×H100 server runs $200–300K depending on vendor and configuration. It's a one-time hit that you depreciate over time on your balance sheet.
OpEx (Operational Expenditure) is everything that recurs: power, colocation fees, networking, staffing, maintenance, software licenses. For a GPU cluster, annual OpEx typically runs 20–30% of the original CapEx, depending on power rates and facility terms.
The number that actually matters is TCO (Total Cost of Ownership): the full cost of owning and operating hardware over its useful life, including CapEx, OpEx, financing costs, and the residual value you recover (or don't) when you sell the hardware. TCO is the figure you compare against cloud pricing.
Residual value is what your hardware is worth when you're done with it. A 3-year-old H100 server won't be worthless, but it won't be worth what you paid. How much it retains depends on the secondary market, the next generation of hardware, and the condition of your equipment. As we covered in the chip section, NVIDIA ships a new architecture roughly every two years, each one repricing everything before it. That depreciation risk is what residual value insurance addresses.
Most operators plan a refresh cycle of 3–5 years, selling old hardware on the secondary market and deploying the new generation. The decision of when to refresh is a constant tension between squeezing more life from existing hardware and falling behind on performance per dollar.
Cloud and neocloud pricing is measured in GPU-hours: one GPU running for one hour. When a neocloud charges $2.50/GPU-hour for an H100, that's the unit. To compare against ownership costs, you need to know your utilization rate, the percentage of time your GPUs are doing useful work. A training cluster running 24/7 might hit 90%+. A shared inference cluster might average 40–60%. Utilization is the single biggest factor in whether owning hardware beats renting from the cloud.
Glossary
- AI accelerator —
- Any non-CPU chip optimized for AI math (GPUs, TPUs, custom ASICs).
- Air cooling —
- Fans and heatsinks; works for GPUs up to ~700W TDP.
- Baseboard —
- Physical board with high-speed wiring between GPU slots. Required for SXM GPUs.
- BMC (Baseboard Management Controller) —
- Embedded computer for remote server management (power, monitoring, firmware).
- CapEx (Capital Expenditure) —
- Upfront cost of buying hardware.
- CDU (Coolant Distribution Unit) —
- Pump and heat exchange system that circulates liquid coolant through a DLC loop.
- Cluster —
- Multiple servers connected with high-speed networking, working as one system.
- Colocation (colo) —
- Renting physical space, power, and cooling in a third-party data center.
- CPU (Central Processing Unit) —
- General-purpose processor; handles orchestration, data loading, and system management in GPU servers.
- Data center —
- Facility purpose-built to house servers: redundant power, cooling, physical security.
- DCIM (Data Center Infrastructure Management) —
- Software that monitors power, cooling, capacity, and assets across a facility.
- DLC (Direct liquid cooling) —
- Cold plates mounted on GPUs with circulating liquid. Required for Blackwell-class GPUs at full power.
- DPU (Data Processing Unit) —
- Smart NIC with its own processor that offloads networking, security, and storage tasks.
- Fleet —
- A company's total deployed hardware across all locations.
- Full fat-tree —
- Network wiring for maximum speed between any two servers. More switches, more expensive.
- GPU (Graphics Processing Unit) —
- Processor with thousands of small cores that run matrix math in parallel. The dominant chip for AI.
- GPU-hours —
- Cloud pricing unit: one GPU running for one hour.
- HBM (High Bandwidth Memory) —
- Memory stacked on top of the GPU die; extremely fast due to thousands of direct connections.
- HGX —
- NVIDIA's baseboard holding 4 or 8 SXM GPUs with direct high-speed interconnects.
- Hyperscalers —
- AWS, Azure, Google Cloud — operate their own data centers and design custom hardware.
- Inference —
- Running a trained model to get answers. Lower compute per request, but runs continuously.
- InfiniBand (IB) —
- High-bandwidth, low-latency network fabric; the gold standard for multi-node GPU training.
- Matrix multiplication —
- The core math operation in neural networks; determines what hardware you need.
- Neoclouds —
- Companies (CoreWeave, Lambda, etc.) that buy GPU hardware and sell compute as a service.
- NIC (Network Interface Card) —
- Hardware that connects a server to the network.
- Node —
- A single server within a cluster.
- NVLink —
- NVIDIA's high-speed link between GPUs within a server. 900 GB/s bidirectional on Hopper.
- NVMe SSD —
- Fast persistent storage for datasets, checkpoints, and scratch space.
- NVSwitch —
- Chip on the HGX baseboard that creates all-to-all NVLink topology between GPUs.
- OAM (OCP Accelerator Module) —
- Open-standard GPU module design by the Open Compute Project.
- ODM (Original Design Manufacturer) —
- Contract manufacturers (Foxconn, Quanta) that build servers for hyperscalers.
- OEM (Original Equipment Manufacturer) —
- Dell, HPE, Supermicro, Lenovo — assemble branded servers with warranties and support.
- OpEx (Operational Expenditure) —
- Recurring costs: power, colocation, staffing, maintenance, licenses.
- Parameters —
- Numerical weights in a neural network that encode what the model has learned.
- PCIe —
- Standard add-in card slot. Simpler and cheaper than SXM, but slower GPU-to-GPU bandwidth.
- Power density —
- How much power a single rack draws. AI racks (40–100kW+) far exceed traditional enterprise racks (5–10kW).
- PSU (Power Supply Unit) —
- Converts AC facility power to DC for the server. GPU servers need 2–6 for redundancy.
- PUE (Power Usage Effectiveness) —
- Total facility power / IT equipment power. Lower is better; 1.0 is theoretical perfect.
- Rack unit (U) —
- Standard height measurement in a server rack. 1U = 1.75 inches.
- Rail-optimized —
- Reduced network wiring; cheaper, suited for inference but less for large training runs.
- RAM (DDR4/DDR5) —
- Fast, volatile system memory for data the CPU and GPUs need right now. 512 GB–2 TB typical.
- RDMA (Remote Direct Memory Access) —
- Protocol that lets one server read/write another's memory directly, bypassing CPU and OS.
- RDHx (Rear-door heat exchanger) —
- Liquid cooling unit on the back of a rack that intercepts hot exhaust air.
- Refresh cycle —
- 3–5 year hardware replacement cycle; selling old equipment and deploying the next generation.
- Residual value —
- What hardware is worth when you're done with it.
- Residual value insurance —
- Insurance that protects against hardware depreciation risk.
- RoCE (RDMA over Converged Ethernet) —
- Standard Ethernet adapted for GPU workloads with RDMA. Cheaper than InfiniBand.
- Server —
- Dedicated computer designed to run continuously in a data center.
- Spine-leaf —
- Standard two-layer network topology for AI clusters. Every server equidistant from every other.
- SXM —
- NVIDIA's high-performance GPU socket; mounts onto a dedicated baseboard. Used in all serious training systems.
- TCO (Total Cost of Ownership) —
- Full cost of owning and operating hardware over its useful life (CapEx + OpEx + financing − residual).
- TDP (Thermal Design Power) —
- Maximum watts a chip draws; sets requirements for every cooling and power system above it.
- Tiers I–IV —
- Uptime Institute data center reliability classifications, from no redundancy (Tier I) to fully fault-tolerant (Tier IV).
- Topology —
- How switches and cables are arranged in a cluster network.
- Training —
- How a model learns — adjusting billions of parameters over weeks/months on massive datasets.
- Utilization rate —
- Percentage of time GPUs are doing useful work. The biggest factor in own-vs-rent economics.
- VRAM —
- Video RAM; in data center context, same as HBM.
References
- Epoch AI, "How much does it cost to train frontier AI models?" (2024)
- Reuters, "Nvidia pursues $30 billion custom chip opportunity" (February 2024)
- NVIDIA H100 Tensor Core GPU Datasheet
- NVIDIA DGX B200 System Datasheet
- NVIDIA, "Blackwell Enables 3x Faster Training" (2025)
- NVIDIA Vera Rubin NVL72 Product Page
- NVIDIA NVLink and NVLink Switch
- NVIDIA, "NVIDIA Completes Acquisition of Mellanox" (April 2020)
- NVIDIA Quantum-2 InfiniBand NDR Architecture Datasheet
- Uptime Institute Global Data Center Survey (2024)
- Google Data Centers, "Efficiency: How we measure PUE"
- Uptime Institute, "Tier Classification System"
- Epoch AI, "Data on AI Chip Sales" (2026)
Frequently Asked Questions
What is the difference between SXM and PCIe GPUs?
SXM is NVIDIA's high-performance socket. SXM GPUs mount onto a dedicated baseboard with high-speed wiring between GPU slots built in. They draw roughly 2x the power and deliver higher memory and interconnect bandwidth than PCIe GPUs. Every serious training system uses SXM. The tradeoff: you need the matching baseboard, so you can't swap in GPUs from a different vendor.
What is NVLink and why does it matter?
NVLink is NVIDIA's proprietary high-speed link between GPUs within a server. NVLink 4.0 (Hopper) provides 900 GB/s bidirectional bandwidth per GPU, over 10x faster than the standard PCIe bus at roughly 64 GB/s per x16 slot. Without NVLink, GPUs fall back to PCIe and cannot train large models efficiently.
What is a GPU cluster?
A cluster is multiple GPU servers connected with high-speed networking, working together as one system. A single machine with 8 GPUs can fine-tune small models, but training anything at the frontier requires hundreds or thousands of GPUs. Each server in a cluster is called a node. A cluster can scale from 2 nodes to thousands with 100,000+ GPUs total.
Should you own or rent GPU hardware?
Utilization rate is the single biggest factor. A training cluster running 24/7 might hit 90%+. A shared inference cluster might average 40-60%. TCO (total cost of ownership), which includes CapEx, OpEx, financing costs, and residual value, is the figure you compare against cloud pricing.
Coverage creates a minimum value for what your GPUs are worth at a future date. If they sell below the floor, the policy pays you the difference.
Learn how it works →