NVIDIA AI GPU Differences from Ampere to Blackwell

Mar 10, 2026·Illia Kasian, CTO·AC Research

NVIDIA has shipped four generations of AI accelerators since 2017 (Volta, Ampere, Hopper, and Blackwell), each with architecture changes that matter more than raw TFLOPS. This guide explains what the specs actually measure, how the generations compare, and which one fits your workload and budget in 2026.

Inference tokens per second (MLPerf)

1,630*

4,3742.7x

12,9343.0x

Ampere

2020

Hopper

2022

Blackwell

2025

*A100 MLPerf estimate from v3.1 GPT-J ratio^[1]

The chart above uses MLPerf, an industry-standard benchmark suite run by MLCommons, where vendors submit optimized results on identical models and datasets. Real-world performance varies for several reasons:

Splitting a model across multiple GPUs makes interconnect bandwidth a bottleneck
Batch size, sequence length, and quantization strategy all shift which hardware spec matters most
Software improvements (better kernels, compiler optimizations, serving frameworks like vLLM) routinely deliver double-digit speedups on the same hardware years after launch

Cores, FLOPS, and precision

A GPU is a massively parallel processor. Where a CPU has 8-64 cores optimized for sequential tasks, a data center GPU has thousands of smaller cores running simultaneously. NVIDIA's data center GPUs have two types. CUDA cores are general-purpose: they handle activation functions and anything that isn't matrix math. Tensor Cores are specialized circuits that multiply entire matrix blocks in a single clock cycle. V100 introduced Tensor Cores in 2017, delivering 5-12x more training throughput than its predecessor, the P100.^{[4]NVIDIA V100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/v100/}

Matrix multiplication accounts for 80-90% of training and inference compute in modern neural networks,^{[5]Ailon et al., "Changing Base Without Losing Pace" (2025) — MatMuls as 80-90% of DNN training/inference computehttps://arxiv.org/abs/2503.12211} so Tensor Core throughput is the number everyone compares. That throughput is measured in FLOPS: floating-point operations per second. One FLOP is a single multiply or add to a decimal number. TFLOPS is trillions of FLOPS.

Precision is how many bits the GPU uses to represent each number. Think of it like rounding: FP32 stores a number like 3.14159265, FP16 rounds to something like 3.14, and FP8 rounds further to 3.1. Fewer bits means less accuracy, but the GPU moves less data per operation and fits more operations per cycle. Halving the bits roughly doubles throughput. The tradeoff works because neural networks are tolerant of rounding: training in 16-bit precision produces the same model quality as 32-bit for most workloads.

FP32

32 bits

Framework default, rarely used for AI compute

FP16

16 bits

V100 (2017). Replaced by BF16

BF16

16 bits

A100 (2020). Training standard for most LLMs

FP8

8 bits

H100 (2022). Inference standard; training proven at scale

FP4

4 bits

B200 (2025). Emerging for inference

Each number is stored in three parts: sign (positive or negative), exponent (how large or small the number can be), and mantissa (how many decimal places it keeps). FP16 shrank the exponent from 8 bits to 5, which limited the range of numbers it could represent and caused training instability. BF16 fixed this by keeping FP32's full 8-bit exponent while cutting the mantissa instead, giving the same numeric range at half the total bits. That tradeoff, less decimal precision but full range, turned out to be what training needs, and BF16 replaced FP16 as the default. New formats historically take 3-4 years from hardware support to widespread adoption.^{[6]Epoch AI, "Training Precision" — precision adoption across 272 notable AI models (2008-2025)https://epoch.ai/data-insights/training-precision} DeepSeek-V3 validated FP8 training at frontier scale in late 2024.^{[7]DeepSeek-V3 Technical Report (2024) — first frontier model trained with native FP8https://arxiv.org/abs/2412.19437}

The Transformer Engine, introduced on H100, is hardware that automatically switches between FP8 and FP16 during transformer layer computations. It monitors numerical ranges per layer and picks the lowest precision that preserves accuracy, giving near-FP8 throughput with FP16 training quality.^{[8]NVIDIA H100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/h100/} Blackwell's second-generation Transformer Engine extends this to FP4 for inference.

NVIDIA publishes two TFLOPS numbers for each GPU: dense and with-sparsity. Neural networks often end up with many zero values in their weight matrices. Structured sparsity, introduced on A100, takes advantage of this: if you force exactly half the values in each small block to be zero, the hardware can skip the zero multiplications entirely and finish twice as fast. The "with-sparsity" TFLOPS number assumes this optimization is active. Not all models qualify, so this article uses the dense (no zeros skipped) numbers throughout.

125

3122.5x

9893.2x

989

2,2502.3x

2,5001.1x

4,0001.6x

V100

A100

H100

H200

B200

B300

R200

FP16/BF16 Tensor TFLOPS (dense). V100 is FP16; A100 onward is BF16.

What those numbers mean in practice: OpenAI trained GPT-3 (175B parameters) on 10,000 V100s at 125 TFLOPS each.^{[9]Brown et al., "Language Models are Few-Shot Learners" (2020) — GPT-3 trained on V100 GPUshttps://arxiv.org/abs/2005.14165} At H100's 989 TFLOPS, roughly 8x the raw throughput, the same compute budget fits on around 1,200 GPUs. Meta used that headroom differently, training the much larger Llama 3 on 24,000 H100s.^{[10]Meta AI, "Introducing Meta Llama 3" (2024) — trained on 24K H100 GPU clustershttps://ai.meta.com/blog/meta-llama-3/} Each generation lets you either shrink your cluster or train a bigger model on the same one.

Memory

GPU memory holds the model (its learned parameters and the data flowing through it during computation). NVIDIA's data center GPUs use HBM (High Bandwidth Memory): chips stacked vertically on the GPU package for much higher throughput than standard memory. Each HBM generation, HBM2, HBM2e, HBM3, HBM3e, increases both capacity and bandwidth.

Capacity determines the largest model that fits on one GPU without splitting across multiple GPUs. Bandwidth is how fast data moves from memory to the Tensor Cores. For inference, bandwidth often matters more than raw TFLOPS because the GPU reads the full set of model weights from memory for every token it generates. Each token also requires intermediate computations from all previous tokens. Rather than redo that math every time, the GPU stores those results in a KV cache. The longer the conversation, the larger the cache, and it shares the same memory pool as the model weights. So during inference, both capacity and bandwidth are under pressure at the same time.

802.5x

1411.8x

1801.3x

2881.6x

288

V100
HBM2

A100
HBM2e

H100
HBM3

H200
HBM3e

B200
HBM3e

B300
HBM3e

R200
HBM4

Memory capacity (GB)

900

2,0002.2x

3,3501.7x

4,8001.4x

8,0001.7x

8,000

22,0002.8x

V100
HBM2

A100
HBM2e

H100
HBM3

H200
HBM3e

B200
HBM3e

B300
HBM3e

R200
HBM4

Memory bandwidth (GB/s)

V100's 32 GB of memory limited single-GPU models to roughly 1-2 billion parameters. A100 doubled capacity to 80 GB. By late 2023, the bottleneck for large-model inference had shifted from compute to memory bandwidth: H200 exists because of that shift, delivering 76% more bandwidth than H100 with the same compute die. B300 pushed to 288 GB using 12-layer HBM3e stacks. A model too large for one GPU must be split across several, which adds communication overhead. More memory per GPU means more of the model fits on each one, reducing that overhead. Rubin's HBM4 is the biggest jump on the chart: 22 TB/s, nearly 3x B300. HBM works by stacking memory chips vertically on the GPU package. Previous generations increased bandwidth by adding more of these stacks. HBM4 instead makes each stack's connection to the GPU wider, moving more data per cycle.

Die size and multi-die packaging

A die is the physical silicon chip inside the GPU package. The nanometer number (12nm, 7nm, 4nm) is the manufacturing process: smaller means more transistors fit in the same area, giving either more compute or lower power per transistor.

Bigger dies have more transistors but also more defects per wafer, which pushes manufacturing yield down and cost up. Rather than building one enormous chip, NVIDIA put two dies on a single Blackwell package connected by a 10 TB/s chip-to-chip interconnect (NV-HBI) that makes them appear as a single GPU to software. The result: 208 billion transistors, up from Hopper's 80 billion on a single die.

Power

TDP (thermal design power) is how many watts the GPU draws under load. Every generation has shipped at higher TDP than the last, and that number determines which data centers can physically host your hardware.

5.0 kW300W

6.5 kW400W

10.0 kW700W

13.0 kW1,000W

15.0 kW1,400W

V100
Air

A100
Air

H100
Air/liquid

H200
Air/liquid

B200
Air/liquid

B300
Liquid

Per-GPU TDP (solid) and 8-GPU server draw (full bar)

Most data centers built before 2020 support 5-10 kW per rack. A single 8-GPU Blackwell server draws around 13 kW, exceeding that on its own. New AI-optimized facilities provision 40-100+ kW per rack with direct liquid cooling to the chip.

Air cooling worked for B200 at 1,000W, including NVIDIA's own DGX B200. Liquid cooling enables a higher 1,200W mode on B200. B300 at 1,400W effectively requires liquid cooling.

Running one 8-GPU B300 server at full load for a year at $0.10/kWh costs roughly $13,000 in power alone, before cooling overhead. A 1,000-GPU cluster's annual power bill can exceed $1 million.

NVLink

Training large models means splitting work across GPUs. NVLink is a direct GPU-to-GPU connection that bypasses the CPU, providing much higher bandwidth than the PCIe bus.

300

6002.0x

9001.5x

900

1,8002.0x

1,800

3,6002.0x

V100

A100

H100

H200

B200

B300

R200

NVLink bandwidth (GB/s)

V100's NVLink 2.0 connected up to 8 GPUs in a single server, but clusters scaled poorly beyond that. A100 added NVSwitch, a dedicated chip that lets all 8 GPUs communicate at full bandwidth without CPU involvement. Blackwell's NVLink 5 doubled bandwidth again to 1.8 TB/s per GPU. A full 8-GPU B200 node communicates at 14.4 TB/s aggregate NVLink bandwidth. At rack scale, NVIDIA's NVL72 configurations connect 72 GPUs in a single NVLink domain, meaning every GPU can communicate directly with every other at full bandwidth.

Multi-tenancy (MIG)

MIG (Multi-Instance GPU), introduced on A100, splits a single GPU into up to seven hardware-isolated instances, each with dedicated compute, memory, and bandwidth. Cloud providers use MIG to serve multiple inference customers from one GPU, which drove the unit economics that made GPU cloud profitable. MIG carries forward on H100 and later GPUs.

Form factors

NVIDIA ships data center GPUs in two socket types. SXM mounts the GPU directly on a baseboard called HGX, a board that holds all GPUs and their NVLink wiring, with NVLink connections to other GPUs. PCIe is the standard slot found in most servers. SXM delivers higher power limits and full NVLink bandwidth. PCIe versions existed for V100 through H200 at lower TDP. Starting with Blackwell, NVIDIA only ships SXM.

NVIDIA also ships Blackwell as Grace Blackwell, where a custom Arm CPU and two GPUs sit on a single module with a direct 900 GB/s link replacing PCIe.

Every generation at a glance

All specs below are SXM variants with dense TFLOPS.

	V100	A100	H100	H200	B200	B300	R200
Architecture	Volta	Ampere	Hopper	Hopper	Blackwell	Blackwell	Rubin
Year	2017	2020	2022	2024	2025	2025	2026
BF16 Tensor	125 TFLOPS*^{[4]NVIDIA V100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/v100/}	312 TFLOPS^{[11]NVIDIA A100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/a100/}	989 TFLOPS^{[8]NVIDIA H100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/h100/}	989 TFLOPS^{[12]NVIDIA H200 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/h200/}	2,250 TFLOPS^{[13]NVIDIA DGX B200 Datasheethttps://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet}	~2,500 TFLOPS	4,000 TFLOPS^{[14]NVIDIA Vera Rubin NVL72 Product Page — per-GPU and system specshttps://www.nvidia.com/en-us/data-center/vera-rubin-nvl72/}
FP8 Tensor	—	—	1,979 TFLOPS^{[8]NVIDIA H100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/h100/}	1,979 TFLOPS	4,500 TFLOPS^{[13]NVIDIA DGX B200 Datasheethttps://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet}	5,000 TFLOPS^{[15]NVIDIA Developer Blog — Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Erahttps://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/}	17,500 TFLOPS^{[14]NVIDIA Vera Rubin NVL72 Product Page — per-GPU and system specshttps://www.nvidia.com/en-us/data-center/vera-rubin-nvl72/}
FP4 Tensor	—	—	—	—	9,000 TFLOPS^{[13]NVIDIA DGX B200 Datasheethttps://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet}	15,000 TFLOPS^{[15]NVIDIA Developer Blog — Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Erahttps://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/}	50,000 TFLOPS^{[16]NVIDIA Developer Blog — Inside the NVIDIA Rubin Platform: Six New Chips (2026)https://developer.nvidia.com/blog/inside-the-nvidia-rubin-platform-six-new-chips-one-ai-supercomputer/}
Memory	32 GB HBM2	80 GB HBM2e	80 GB HBM3	141 GB HBM3e	180 GB HBM3e	288 GB HBM3e	288 GB HBM4
Mem BW	900 GB/s	2.0 TB/s	3.35 TB/s	4.8 TB/s	8.0 TB/s	8.0 TB/s	22 TB/s
NVLink BW	300 GB/s	600 GB/s	900 GB/s	900 GB/s	1.8 TB/s	1.8 TB/s	3.6 TB/s
TDP	300W	400W	700W	700W	1,000W	1,400W	TBD

*V100 supported FP16, not BF16. The 125 TFLOPS figure is FP16 Tensor Core throughput, listed here for cross-generation comparison.

The most important column depends on your workload. For training, focus on Tensor Core TFLOPS at your target precision and NVLink bandwidth. For inference, memory capacity and memory bandwidth usually matter more than peak compute.

What comes after Blackwell

NVIDIA announced the Rubin platform at CES 2026, with systems available from partners in the second half of 2026.^{[17]NVIDIA Rubin Platform Announcement, CES 2026https://nvidianews.nvidia.com/news/rubin-platform-ai-supercomputer} Rubin is not just a GPU upgrade. It is a six-chip platform redesign: the Rubin GPU, Vera CPU, NVLink 6 switch, ConnectX-9 network adapter, BlueField-4 data processing unit, and Spectrum-6 Ethernet switch, all designed together.^{[16]NVIDIA Developer Blog — Inside the NVIDIA Rubin Platform: Six New Chips (2026)https://developer.nvidia.com/blog/inside-the-nvidia-rubin-platform-six-new-chips-one-ai-supercomputer/}

Compute and precision. Rubin moves to TSMC 3nm and delivers roughly 50,000 FP4 TFLOPS, about 3.3x B300.^{[18]ServeTheHome, "NVIDIA Launches Next-Generation Rubin AI Compute Platform at CES 2026"https://www.servethehome.com/nvidia-launches-next-generation-rubin-ai-compute-platform-at-ces-2026/} Where Blackwell introduced FP4 as an emerging inference format, Rubin is designed around it.

Memory. Rubin is the first generation with HBM4, jumping to 22 TB/s bandwidth, nearly 3x B300's 8 TB/s. As covered above, Blackwell's compute often outpaces its memory feed during inference. HBM4 closes that gap.

Die. Rubin Ultra, planned for 2027, doubles from two compute dies to four per package, with 1 TB of HBM4e.^{[19]The Next Platform, "Nvidia Draws GPU System Roadmap Out To 2028" (March 2025)https://www.nextplatform.com/2025/03/19/nvidia-draws-gpu-system-roadmap-out-to-2028/}

NVLink. NVLink 6 doubles bandwidth again to 3.6 TB/s per GPU. Rubin ships as the Vera Rubin NVL72 (72 GPUs in a single NVLink 6 domain) and the HGX Rubin NVL8 (8-GPU server board for x86 platforms, same form factor as HGX Blackwell).^{[17]NVIDIA Rubin Platform Announcement, CES 2026https://nvidianews.nvidia.com/news/rubin-platform-ai-supercomputer}^{[18]ServeTheHome, "NVIDIA Launches Next-Generation Rubin AI Compute Platform at CES 2026"https://www.servethehome.com/nvidia-launches-next-generation-rubin-ai-compute-platform-at-ces-2026/}

References

Frequently Asked Questions

What is the difference between the H100 and B200?

B200 (Blackwell) delivers roughly 2.3x the BF16 Tensor TFLOPS of the H100 (2,250 vs 989), 2.4x the memory bandwidth (8.0 vs 3.35 TB/s), and 2x the NVLink bandwidth (1.8 vs 0.9 TB/s). B200 draws 1,000W compared to H100's 700W and can require liquid cooling at full power.

What is the Transformer Engine on NVIDIA GPUs?

The Transformer Engine, introduced on H100, is hardware that automatically switches between FP8 and FP16 during transformer layer computations. It monitors numerical ranges per layer and picks the lowest precision that preserves accuracy, giving near-FP8 throughput with FP16 training quality. Blackwell's second-generation Transformer Engine extends this to FP4 for inference.

Why does memory bandwidth matter more than TFLOPS for inference?

For inference, the GPU reads the full set of model weights from memory for every token it generates. Each token also requires intermediate computations from all previous tokens, stored in a KV cache that shares the same memory pool as the model weights. Both capacity and bandwidth are under pressure at the same time. H200 exists because of this shift, delivering 76% more bandwidth than H100 with the same compute die.

What comes after Blackwell?

NVIDIA announced the Rubin platform at CES 2026, with systems available from partners in the second half of 2026. Rubin moves to TSMC 3nm and delivers roughly 50,000 FP4 TFLOPS, about 3.3x B300. It is the first generation with HBM4, jumping to 22 TB/s bandwidth, nearly 3x B300's 8 TB/s.

Bridging GPU operators and financing partners

We help emerging neoclouds find financing partners, and help financing partners enhance story credit with GPU collateral management and residual value insurance solutions.

Learn how it works →

ShareLinkedIn X Facebook