NVIDIA AI GPU Differences from Ampere to Blackwell

·Illia Kasian

NVIDIA has shipped four generations of AI accelerators since 2017 (Volta, Ampere, Hopper, and Blackwell), each with architecture changes that matter more than raw TFLOPS. This guide explains what the specs actually measure, how the generations compare, and which one fits your workload and budget in 2026.

Inference tokens per second (MLPerf)

1,630*
4,3742.7x
12,9343.0x
Ampere
2020
Hopper
2022
Blackwell
2025
*A100 MLPerf estimate from v3.1 GPT-J ratio[1]

The chart above uses MLPerf, an industry-standard benchmark suite run by MLCommons, where vendors submit optimized results on identical models and datasets. Real-world performance varies for several reasons:

  • Splitting a model across multiple GPUs makes interconnect bandwidth a bottleneck
  • Batch size, sequence length, and quantization strategy all shift which hardware spec matters most
  • Software improvements (better kernels, compiler optimizations, serving frameworks like vLLM) routinely deliver double-digit speedups on the same hardware years after launch

Cores, FLOPS, and precision

A GPU is a massively parallel processor. Where a CPU has 8-64 cores optimized for sequential tasks, a data center GPU has thousands of smaller cores running simultaneously. NVIDIA's data center GPUs have two types. CUDA cores are general-purpose: they handle activation functions and anything that isn't matrix math. Tensor Cores are specialized circuits that multiply entire matrix blocks in a single clock cycle. V100 introduced Tensor Cores in 2017, delivering 5-12x more training throughput than its predecessor, the P100. [4]NVIDIA V100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/v100/

Matrix multiplication accounts for 80-90% of training and inference compute in modern neural networks, [5]Ailon et al., "Changing Base Without Losing Pace" (2025) — MatMuls as 80-90% of DNN training/inference computehttps://arxiv.org/abs/2503.12211 so Tensor Core throughput is the number everyone compares. That throughput is measured in FLOPS: floating-point operations per second. One FLOP is a single multiply or add to a decimal number. TFLOPS is trillions of FLOPS.

Precision is how many bits the GPU uses to represent each number. Think of it like rounding: FP32 stores a number like 3.14159265, FP16 rounds to something like 3.14, and FP8 rounds further to 3.1. Fewer bits means less accuracy, but the GPU moves less data per operation and fits more operations per cycle. Halving the bits roughly doubles throughput. The tradeoff works because neural networks are tolerant of rounding: training in 16-bit precision produces the same model quality as 32-bit for most workloads.

FP32
32 bits
Framework default, rarely used for AI compute
FP16
16 bits
V100 (2017). Replaced by BF16
BF16
16 bits
A100 (2020). Training standard for most LLMs
FP8
8 bits
H100 (2022). Inference standard; training proven at scale
FP4
4 bits
B200 (2025). Emerging for inference

Each number is stored in three parts: sign (positive or negative), exponent (how large or small the number can be), and mantissa (how many decimal places it keeps). FP16 shrank the exponent from 8 bits to 5, which limited the range of numbers it could represent and caused training instability. BF16 fixed this by keeping FP32's full 8-bit exponent while cutting the mantissa instead, giving the same numeric range at half the total bits. That tradeoff, less decimal precision but full range, turned out to be what training needs, and BF16 replaced FP16 as the default. New formats historically take 3-4 years from hardware support to widespread adoption. [6]Epoch AI, "Training Precision" — precision adoption across 272 notable AI models (2008-2025)https://epoch.ai/data-insights/training-precision DeepSeek-V3 validated FP8 training at frontier scale in late 2024. [7]DeepSeek-V3 Technical Report (2024) — first frontier model trained with native FP8https://arxiv.org/abs/2412.19437

The Transformer Engine, introduced on H100, is hardware that automatically switches between FP8 and FP16 during transformer layer computations. It monitors numerical ranges per layer and picks the lowest precision that preserves accuracy, giving near-FP8 throughput with FP16 training quality. [8]NVIDIA H100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/h100/ Blackwell's second-generation Transformer Engine extends this to FP4 for inference.

NVIDIA publishes two TFLOPS numbers for each GPU: dense and with-sparsity. Neural networks often end up with many zero values in their weight matrices. Structured sparsity, introduced on A100, takes advantage of this: if you force exactly half the values in each small block to be zero, the hardware can skip the zero multiplications entirely and finish twice as fast. The "with-sparsity" TFLOPS number assumes this optimization is active. Not all models qualify, so this article uses the dense (no zeros skipped) numbers throughout.

125
3122.5x
9893.2x
989
2,2502.3x
2,5001.1x
4,0001.6x
V100
A100
H100
H200
B200
B300
R200
FP16/BF16 Tensor TFLOPS (dense). V100 is FP16; A100 onward is BF16.

What those numbers mean in practice: OpenAI trained GPT-3 (175B parameters) on 10,000 V100s at 125 TFLOPS each. [9]Brown et al., "Language Models are Few-Shot Learners" (2020) — GPT-3 trained on V100 GPUshttps://arxiv.org/abs/2005.14165 At H100's 989 TFLOPS, roughly 8x the raw throughput, the same compute budget fits on around 1,200 GPUs. Meta used that headroom differently, training the much larger Llama 3 on 24,000 H100s. [10]Meta AI, "Introducing Meta Llama 3" (2024) — trained on 24K H100 GPU clustershttps://ai.meta.com/blog/meta-llama-3/ Each generation lets you either shrink your cluster or train a bigger model on the same one.

Memory

GPU memory holds the model (its learned parameters and the data flowing through it during computation). NVIDIA's data center GPUs use HBM (High Bandwidth Memory): chips stacked vertically on the GPU package for much higher throughput than standard memory. Each HBM generation, HBM2, HBM2e, HBM3, HBM3e, increases both capacity and bandwidth.

Capacity determines the largest model that fits on one GPU without splitting across multiple GPUs. Bandwidth is how fast data moves from memory to the Tensor Cores. For inference, bandwidth often matters more than raw TFLOPS because the GPU reads the full set of model weights from memory for every token it generates. Each token also requires intermediate computations from all previous tokens. Rather than redo that math every time, the GPU stores those results in a KV cache. The longer the conversation, the larger the cache, and it shares the same memory pool as the model weights. So during inference, both capacity and bandwidth are under pressure at the same time.

32
802.5x
80
1411.8x
1801.3x
2881.6x
288
V100
HBM2
A100
HBM2e
H100
HBM3
H200
HBM3e
B200
HBM3e
B300
HBM3e
R200
HBM4
Memory capacity (GB)
900
2,0002.2x
3,3501.7x
4,8001.4x
8,0001.7x
8,000
22,0002.8x
V100
HBM2
A100
HBM2e
H100
HBM3
H200
HBM3e
B200
HBM3e
B300
HBM3e
R200
HBM4
Memory bandwidth (GB/s)

V100's 32 GB of memory limited single-GPU models to roughly 1-2 billion parameters. A100 doubled capacity to 80 GB. By late 2023, the bottleneck for large-model inference had shifted from compute to memory bandwidth: H200 exists because of that shift, delivering 76% more bandwidth than H100 with the same compute die. B300 pushed to 288 GB using 12-layer HBM3e stacks. A model too large for one GPU must be split across several, which adds communication overhead. More memory per GPU means more of the model fits on each one, reducing that overhead. Rubin's HBM4 is the biggest jump on the chart: 22 TB/s, nearly 3x B300. HBM works by stacking memory chips vertically on the GPU package. Previous generations increased bandwidth by adding more of these stacks. HBM4 instead makes each stack's connection to the GPU wider, moving more data per cycle.

Die size and multi-die packaging

A die is the physical silicon chip inside the GPU package. The nanometer number (12nm, 7nm, 4nm) is the manufacturing process: smaller means more transistors fit in the same area, giving either more compute or lower power per transistor.

Bigger dies have more transistors but also more defects per wafer, which pushes manufacturing yield down and cost up. Rather than building one enormous chip, NVIDIA put two dies on a single Blackwell package connected by a 10 TB/s chip-to-chip interconnect (NV-HBI) that makes them appear as a single GPU to software. The result: 208 billion transistors, up from Hopper's 80 billion on a single die.

Power

TDP (thermal design power) is how many watts the GPU draws under load. Every generation has shipped at higher TDP than the last, and that number determines which data centers can physically host your hardware.

5.0 kW300W
6.5 kW400W
10.0 kW700W
10.0 kW700W
13.0 kW1,000W
15.0 kW1,400W
V100
Air
A100
Air
H100
Air/liquid
H200
Air/liquid
B200
Air/liquid
B300
Liquid
Per-GPU TDP (solid) and 8-GPU server draw (full bar)

Most data centers built before 2020 support 5-10 kW per rack. A single 8-GPU Blackwell server draws around 13 kW, exceeding that on its own. New AI-optimized facilities provision 40-100+ kW per rack with direct liquid cooling to the chip.

Air cooling worked for B200 at 1,000W, including NVIDIA's own DGX B200. Liquid cooling enables a higher 1,200W mode on B200. B300 at 1,400W effectively requires liquid cooling.

Running one 8-GPU B300 server at full load for a year at $0.10/kWh costs roughly $13,000 in power alone, before cooling overhead. A 1,000-GPU cluster's annual power bill can exceed $1 million.

Training large models means splitting work across GPUs. NVLink is a direct GPU-to-GPU connection that bypasses the CPU, providing much higher bandwidth than the PCIe bus.

300
6002.0x
9001.5x
900
1,8002.0x
1,800
3,6002.0x
V100
A100
H100
H200
B200
B300
R200
NVLink bandwidth (GB/s)

V100's NVLink 2.0 connected up to 8 GPUs in a single server, but clusters scaled poorly beyond that. A100 added NVSwitch, a dedicated chip that lets all 8 GPUs communicate at full bandwidth without CPU involvement. Blackwell's NVLink 5 doubled bandwidth again to 1.8 TB/s per GPU. A full 8-GPU B200 node communicates at 14.4 TB/s aggregate NVLink bandwidth. At rack scale, NVIDIA's NVL72 configurations connect 72 GPUs in a single NVLink domain, meaning every GPU can communicate directly with every other at full bandwidth.

Multi-tenancy (MIG)

MIG (Multi-Instance GPU), introduced on A100, splits a single GPU into up to seven hardware-isolated instances, each with dedicated compute, memory, and bandwidth. Cloud providers use MIG to serve multiple inference customers from one GPU, which drove the unit economics that made GPU cloud profitable. MIG carries forward on H100 and later GPUs.

Form factors

NVIDIA ships data center GPUs in two socket types. SXM mounts the GPU directly on a baseboard called HGX, a board that holds all GPUs and their NVLink wiring, with NVLink connections to other GPUs. PCIe is the standard slot found in most servers. SXM delivers higher power limits and full NVLink bandwidth. PCIe versions existed for V100 through H200 at lower TDP. Starting with Blackwell, NVIDIA only ships SXM.

NVIDIA also ships Blackwell as Grace Blackwell, where a custom Arm CPU and two GPUs sit on a single module with a direct 900 GB/s link replacing PCIe.

Every generation at a glance

All specs below are SXM variants with dense TFLOPS.

V100A100H100H200B200B300R200
ArchitectureVoltaAmpereHopperHopperBlackwellBlackwellRubin
Year2017202020222024202520252026
BF16 Tensor125 TFLOPS* [4]NVIDIA V100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/v100/312 TFLOPS [11]NVIDIA A100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/a100/989 TFLOPS [8]NVIDIA H100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/h100/989 TFLOPS [12]NVIDIA H200 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/h200/2,250 TFLOPS [13]NVIDIA DGX B200 Datasheethttps://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet~2,500 TFLOPS4,000 TFLOPS [14]NVIDIA Vera Rubin NVL72 Product Page — per-GPU and system specshttps://www.nvidia.com/en-us/data-center/vera-rubin-nvl72/
FP8 Tensor1,979 TFLOPS [8]NVIDIA H100 Tensor Core GPU Datasheethttps://www.nvidia.com/en-us/data-center/h100/1,979 TFLOPS4,500 TFLOPS [13]NVIDIA DGX B200 Datasheethttps://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet5,000 TFLOPS [15]NVIDIA Developer Blog — Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Erahttps://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/17,500 TFLOPS [14]NVIDIA Vera Rubin NVL72 Product Page — per-GPU and system specshttps://www.nvidia.com/en-us/data-center/vera-rubin-nvl72/
FP4 Tensor9,000 TFLOPS [13]NVIDIA DGX B200 Datasheethttps://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet15,000 TFLOPS [15]NVIDIA Developer Blog — Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Erahttps://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/50,000 TFLOPS [16]NVIDIA Developer Blog — Inside the NVIDIA Rubin Platform: Six New Chips (2026)https://developer.nvidia.com/blog/inside-the-nvidia-rubin-platform-six-new-chips-one-ai-supercomputer/
Memory32 GB HBM280 GB HBM2e80 GB HBM3141 GB HBM3e180 GB HBM3e288 GB HBM3e288 GB HBM4
Mem BW900 GB/s2.0 TB/s3.35 TB/s4.8 TB/s8.0 TB/s8.0 TB/s22 TB/s
NVLink BW300 GB/s600 GB/s900 GB/s900 GB/s1.8 TB/s1.8 TB/s3.6 TB/s
TDP300W400W700W700W1,000W1,400WTBD

*V100 supported FP16, not BF16. The 125 TFLOPS figure is FP16 Tensor Core throughput, listed here for cross-generation comparison.

The most important column depends on your workload. For training, focus on Tensor Core TFLOPS at your target precision and NVLink bandwidth. For inference, memory capacity and memory bandwidth usually matter more than peak compute.

What comes after Blackwell

NVIDIA announced the Rubin platform at CES 2026, with systems available from partners in the second half of 2026. [17]NVIDIA Rubin Platform Announcement, CES 2026https://nvidianews.nvidia.com/news/rubin-platform-ai-supercomputer Rubin is not just a GPU upgrade. It is a six-chip platform redesign: the Rubin GPU, Vera CPU, NVLink 6 switch, ConnectX-9 network adapter, BlueField-4 data processing unit, and Spectrum-6 Ethernet switch, all designed together. [16]NVIDIA Developer Blog — Inside the NVIDIA Rubin Platform: Six New Chips (2026)https://developer.nvidia.com/blog/inside-the-nvidia-rubin-platform-six-new-chips-one-ai-supercomputer/

Compute and precision. Rubin moves to TSMC 3nm and delivers roughly 50,000 FP4 TFLOPS, about 3.3x B300. [18]ServeTheHome, "NVIDIA Launches Next-Generation Rubin AI Compute Platform at CES 2026"https://www.servethehome.com/nvidia-launches-next-generation-rubin-ai-compute-platform-at-ces-2026/ Where Blackwell introduced FP4 as an emerging inference format, Rubin is designed around it.

Memory. Rubin is the first generation with HBM4, jumping to 22 TB/s bandwidth, nearly 3x B300's 8 TB/s. As covered above, Blackwell's compute often outpaces its memory feed during inference. HBM4 closes that gap.

Die. Rubin Ultra, planned for 2027, doubles from two compute dies to four per package, with 1 TB of HBM4e. [19]The Next Platform, "Nvidia Draws GPU System Roadmap Out To 2028" (March 2025)https://www.nextplatform.com/2025/03/19/nvidia-draws-gpu-system-roadmap-out-to-2028/

NVLink. NVLink 6 doubles bandwidth again to 3.6 TB/s per GPU. Rubin ships as the Vera Rubin NVL72 (72 GPUs in a single NVLink 6 domain) and the HGX Rubin NVL8 (8-GPU server board for x86 platforms, same form factor as HGX Blackwell). [17]NVIDIA Rubin Platform Announcement, CES 2026https://nvidianews.nvidia.com/news/rubin-platform-ai-supercomputer [18]ServeTheHome, "NVIDIA Launches Next-Generation Rubin AI Compute Platform at CES 2026"https://www.servethehome.com/nvidia-launches-next-generation-rubin-ai-compute-platform-at-ces-2026/

References

  1. MLCommons MLPerf Inference v3.1 Results — NVIDIA A100 and H100 GPT-J submissions
  2. MLCommons MLPerf Inference v5.0 Results (April 2025)
  3. NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf v5.1 (September 2025)
  4. NVIDIA V100 Tensor Core GPU Datasheet
  5. Ailon et al., "Changing Base Without Losing Pace" (2025) — MatMuls as 80-90% of DNN training/inference compute
  6. Epoch AI, "Training Precision" — precision adoption across 272 notable AI models (2008-2025)
  7. DeepSeek-V3 Technical Report (2024) — first frontier model trained with native FP8
  8. NVIDIA H100 Tensor Core GPU Datasheet
  9. Brown et al., "Language Models are Few-Shot Learners" (2020) — GPT-3 trained on V100 GPUs
  10. Meta AI, "Introducing Meta Llama 3" (2024) — trained on 24K H100 GPU clusters
  11. NVIDIA A100 Tensor Core GPU Datasheet
  12. NVIDIA H200 Tensor Core GPU Datasheet
  13. NVIDIA DGX B200 Datasheet
  14. NVIDIA Vera Rubin NVL72 Product Page — per-GPU and system specs
  15. NVIDIA Developer Blog — Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era
  16. NVIDIA Developer Blog — Inside the NVIDIA Rubin Platform: Six New Chips (2026)
  17. NVIDIA Rubin Platform Announcement, CES 2026
  18. ServeTheHome, "NVIDIA Launches Next-Generation Rubin AI Compute Platform at CES 2026"
  19. The Next Platform, "Nvidia Draws GPU System Roadmap Out To 2028" (March 2025)

Frequently Asked Questions

What is the difference between the H100 and B200?

B200 (Blackwell) delivers roughly 2.3x the BF16 Tensor TFLOPS of the H100 (2,250 vs 989), 2.4x the memory bandwidth (8.0 vs 3.35 TB/s), and 2x the NVLink bandwidth (1.8 vs 0.9 TB/s). B200 draws 1,000W compared to H100's 700W and can require liquid cooling at full power.

What is the Transformer Engine on NVIDIA GPUs?

The Transformer Engine, introduced on H100, is hardware that automatically switches between FP8 and FP16 during transformer layer computations. It monitors numerical ranges per layer and picks the lowest precision that preserves accuracy, giving near-FP8 throughput with FP16 training quality. Blackwell's second-generation Transformer Engine extends this to FP4 for inference.

Why does memory bandwidth matter more than TFLOPS for inference?

For inference, the GPU reads the full set of model weights from memory for every token it generates. Each token also requires intermediate computations from all previous tokens, stored in a KV cache that shares the same memory pool as the model weights. Both capacity and bandwidth are under pressure at the same time. H200 exists because of this shift, delivering 76% more bandwidth than H100 with the same compute die.

What comes after Blackwell?

NVIDIA announced the Rubin platform at CES 2026, with systems available from partners in the second half of 2026. Rubin moves to TSMC 3nm and delivers roughly 50,000 FP4 TFLOPS, about 3.3x B300. It is the first generation with HBM4, jumping to 22 TB/s bandwidth, nearly 3x B300's 8 TB/s.

Residual Value Insurance Solutions for GPUs

Coverage creates a minimum value for what your GPUs are worth at a future date. If they sell below the floor, the policy pays you the difference.

Learn how it works →
NVIDIA AI GPU Differences from Ampere to Blackwell | American Compute