When a GPU Dies in Production
After detecting a GPU failure, you file a claim with your supplier and wait for a replacement to arrive. This can be as quick as next business day if you have advance replacement in your contract, to 3-14 business days without. Once it arrives, a technician swaps the card (2-4 hours), runs diagnostics, and returns the node to production. If you keep spares on-site, the full cycle takes under a day.
Do you need to worry about GPU failures
A single GPU is fairly reliable on its own. On average, individual H100s only fail every few tens of thousands of hours. [1]Epoch AI, "Hardware Failures Won't Limit AI Scaling" (2024)https://epoch.ai/blog/hardware-failures-wont-limit-ai-scaling Replacing a single GPU is a rare event, at least for smaller AI clusters.
But even if GPU failures are rare, they are still the most common type of hardware interruption. Meta trained Llama 3 405B on 16,384 H100 GPUs over 54 days and recorded 419 unexpected hardware interruptions, or one failure every three hours. [2]Dubey et al., "The Llama 3 Herd of Models," Meta (2024)https://arxiv.org/abs/2407.21783
How to detect a failing GPU
NVIDIA's driver will output XID errors when a GPU fails. You can read about each error code here. For example, XID 48 means the GPU has an "uncorrectable double-bit ECC error in memory" [3]NVIDIA, "XID Errors" (accessed March 2026)https://docs.nvidia.com/deploy/xid-errors/ and pretty much needs to be replaced. [4]NVIDIA, "Dynamic Page Retirement" (accessed March 2026)https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html
DCGM, NVIDIA's Data Center GPU Manager, also helps with ongoing health monitoring. It runs tiered diagnostics from Level 1 (basic readiness checks) through Level 4 (full analysis covering interfaces, memory, thermal and power constraints). [5]NVIDIA, "DCGM Diagnostics" (accessed March 2026)https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
The most frustrating situation is when there are no errors. Silent data corruption (SDC) is when the GPU computes wrong results without logging any error. Meta reported 6 SDC incidents during their 54-day Llama 3 training run. [2]Dubey et al., "The Llama 3 Herd of Models," Meta (2024)https://arxiv.org/abs/2407.21783 Research on unhealthy nodes has shown that SDC can cause training loss spikes and push models toward different optima, even when the per-step perturbations appear small. [6]Ma et al., "Understanding Silent Data Corruption in LLM Training" (2025)https://arxiv.org/abs/2502.12340
Early warning signs [7]NVIDIA, "GPU Debug Guidelines" (accessed March 2026)https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html of a GPU degrading before it fails outright :
- Correctable ECC error counts rising over time
- Thermal throttling events under normal load
- Bandwidth speeds slowing down
How to protect against GPU failure
Warranties and RMA
NVIDIA typically offers a standard warranty of 1-3 years for datacenter GPUs. [8]Based on industry experience and knowledge of NVIDIA datacenter GPU warranty programs The warranty includes RMA (Return Merchandise Authorization), a process where you ship the defective GPU back to the manufacturer and they ship you a replacement. [9]NVIDIA, "RMA Process" (accessed March 2026)https://docs.nvidia.com/deploy/rma-process/index.html NVIDIA's expedited RMA option typically takes 3-5 business days. Standard RMA typically takes 7-14 business days.
OEMs are Original Equipment Manufacturers, like Dell, HPE, Lenovo, and Supermicro. They put together the NVIDIA cards into complete servers, but also offer additional services for these situations. [10]NVIDIA, "Enterprise Support Services" (accessed March 2026)https://www.nvidia.com/en-us/support/enterprise/ A common service is to offer advance replacement with next-business-day SLAs. Here's Supermicro's RMA portal for example. [11]Supermicro, "RMA" (accessed March 2026)https://www.supermicro.com/en/support/rma
| RMA path | Typical timeline |
|---|---|
| NVIDIA standard RMA | 7-14 business days |
| NVIDIA expedite RMA | 3-5 business days |
| OEM advance replacement | Next business day |
| Self-service with on-site spares | 2-6 hours (swap + initial validation) |
Spare GPU strategy
Keeping cold spares on-site is the fastest path back to production. Buying spares in advance feels expensive, but emergency procurement costs significantly more, especially due to downtime. Your OEM or value-added reseller (VAR) will offer spare and replacement services under names like "3 Year Premier Next-Business-Day Response" or "ProSupport with Next-Business-Day Onsite Service." The right strategy depends on your workload type, tolerance for downtime, and cluster size.
Spares are only worthwhile if you have the team and resources to swap them in quickly, or your supplier offers a service to do it for you. PCIe GPUs are standard cards that slide into a PCIe x16 slot, but SXM GPUs require more involved replacement. Lenovo's service manual for SXM5 systems documents a multi-step procedure: powering down, removing the GPU tray assembly, cleaning and reapplying thermal paste, seating the replacement module onto the baseboard with a torque screwdriver, and reassembling. [12]Lenovo, "Install an SXM5 GPU, ThinkSystem SR675 V3" (accessed March 2026)https://pubs.lenovo.com/sr675-v3/install_an_sxm_gpu This is 2-4 hours of hands-on work.
Checkpointing for training
Checkpointing saves training progress at regular intervals. When a failure occurs, you restart from the most recent checkpoint instead of from the beginning. Many training frameworks, like PyTorch, support checkpointing. [13]PyTorch, "Reducing Checkpointing Times by Over 10x with Distributed Asynchronous Checkpointing" (2024)https://pytorch.org/blog/reducing-checkpointing-times/
Automatic failover for inference
Inference services typically run multiple replicas of the same model behind a load balancer. When a GPU fails, the load balancer routes requests to the remaining healthy replicas. The failed node gets replaced without any downtime visible to end users. NVIDIA's Dynamo framework takes this further by migrating in-progress requests from a failing node to a healthy one mid-inference. [14]NVIDIA, "Dynamo Fault Tolerance" (accessed March 2026)https://docs.nvidia.com/dynamo/latest/user-guides/fault-tolerance
How long replacement takes
With on-site spares and automated health monitoring: detection in minutes, a technician swaps the GPU in 2-4 hours, and validation takes another 1-2 hours for DCGM diagnostics. Back in production the same day.
Without spares, standard NVIDIA RMA takes 7-14 business days from when they receive the defective GPU. Or OEM advance replacement ships next business day with a qualifying support contract.
One detail that delays RMA if skipped: NVIDIA requires a diagnostic log from nvidia-bug-report.sh with every RMA request. [9]NVIDIA, "RMA Process" (accessed March 2026)https://docs.nvidia.com/deploy/rma-process/index.html
How to validate a replacement GPU
The replacement GPU needs to be validated before the node is returned to production. The following checklist is based on NVIDIA's GPU Debug Guidelines [7]NVIDIA, "GPU Debug Guidelines" (accessed March 2026)https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html and DCGM Diagnostics [5]NVIDIA, "DCGM Diagnostics" (accessed March 2026)https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html documentation.
- BMC verification. Confirm through the server's Redfish interface [15]NVIDIA, "DGX H100 Redfish API Support" (accessed March 2026)https://docs.nvidia.com/dgx/dgxh100-user-guide/redfish-api-supp.html that the GPU shows a healthy state, correct model identification, matching firmware version, and full PCIe lane count (16/16).
- DCGM diagnostics. Run Level 3 or Level 4. Tests interfaces, memory, thermal response, and power draw. Takes 1-2 hours. [5]NVIDIA, "DCGM Diagnostics" (accessed March 2026)https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- Memory check. Use DCGM framebuffer diagnostics to look for pending page retirements. [4]NVIDIA, "Dynamic Page Retirement" (accessed March 2026)https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html A replacement GPU with pending retirements is suspect.
- NVLink topology. Run
nvidia-smi nvlink --status[16]NVIDIA, "nvidia-smi Documentation" (accessed March 2026)https://docs.nvidia.com/deploy/nvidia-smi/index.html to confirm all links are active and at rated bandwidth. - NCCL bandwidth. Run all_reduce_perf from nccl-tests [17]NVIDIA, "NCCL Tests" (accessed March 2026)https://github.com/NVIDIA/nccl-tests to measure actual GPU-to-GPU communication throughput. Compare against the known baseline for your topology.
- XID monitoring. Watch for any XID errors [3]NVIDIA, "XID Errors" (accessed March 2026)https://docs.nvidia.com/deploy/xid-errors/ during diagnostics and testing. Any XID 48 or XID 79 means the replacement GPU is itself defective.
Only return the node to production workloads after every step passes.
How often do GPUs fail
Epoch AI modeled failure rates across cluster sizes using a per-GPU mean time between failures of roughly 50,000 hours, about 6 years (2024): [1]Epoch AI, "Hardware Failures Won't Limit AI Scaling" (2024)https://epoch.ai/blog/hardware-failures-wont-limit-ai-scaling
| Cluster size | Approximate failure interval |
|---|---|
| 1 GPU | ~6 years |
| 1,000 GPUs | ~2 days |
| 16,384 GPUs (Meta Llama 3) | ~3 hours |
| 100,000 GPUs | ~30 minutes |
Meta still achieved over 90% effective training time despite the failure rate. [2]Dubey et al., "The Llama 3 Herd of Models," Meta (2024)https://arxiv.org/abs/2407.21783 Only 3 of the 419 incidents required significant manual intervention. Automated detection, job restart, and checkpoint recovery handled the rest.
What causes GPU failures
HBM, the memory stacked directly on top of the GPU die, is a common source of failures that require GPU replacement. [2]Dubey et al., "The Llama 3 Herd of Models," Meta (2024)https://arxiv.org/abs/2407.21783 ECC corrects single-bit errors automatically, but uncorrectable double-bit errors mean the GPU needs to be swapped.
A broken NVLink stalls the entire training job because the collective blocks until every participant completes. [18]NVIDIA, "Multi-Node NVLink Troubleshooting Guide" (accessed March 2026)https://docs.nvidia.com/multi-node-nvlink-systems/nvdebug-guide/troubleshooting_guide.html Firmware mismatches between GPUs are a common trigger.
Without sufficient power and cooling, GPUs will have throttling: the clock speed drops and performance degrades without a clean error signal. [19]NVIDIA, "NVML API Reference: Clock Throttle Reasons" (accessed March 2026)https://docs.nvidia.com/deploy/nvml-api/group__nvmlClocksThrottleReasons.html
References
- Epoch AI, "Hardware Failures Won't Limit AI Scaling" (2024)
- Dubey et al., "The Llama 3 Herd of Models," Meta (2024)
- NVIDIA, "XID Errors" (accessed March 2026)
- NVIDIA, "Dynamic Page Retirement" (accessed March 2026)
- NVIDIA, "DCGM Diagnostics" (accessed March 2026)
- Ma et al., "Understanding Silent Data Corruption in LLM Training" (2025)
- NVIDIA, "GPU Debug Guidelines" (accessed March 2026)
- Based on industry experience and knowledge of NVIDIA datacenter GPU warranty programs
- NVIDIA, "RMA Process" (accessed March 2026)
- NVIDIA, "Enterprise Support Services" (accessed March 2026)
- Supermicro, "RMA" (accessed March 2026)
- Lenovo, "Install an SXM5 GPU, ThinkSystem SR675 V3" (accessed March 2026)
- PyTorch, "Reducing Checkpointing Times by Over 10x with Distributed Asynchronous Checkpointing" (2024)
- NVIDIA, "Dynamo Fault Tolerance" (accessed March 2026)
- NVIDIA, "DGX H100 Redfish API Support" (accessed March 2026)
- NVIDIA, "nvidia-smi Documentation" (accessed March 2026)
- NVIDIA, "NCCL Tests" (accessed March 2026)
- NVIDIA, "Multi-Node NVLink Troubleshooting Guide" (accessed March 2026)
- NVIDIA, "NVML API Reference: Clock Throttle Reasons" (accessed March 2026)
Frequently Asked Questions
How often do GPUs fail in a large cluster?
A single GPU has a mean time between failures of roughly 50,000 hours, about 6 years. At scale, failures are frequent. Meta trained Llama 3 405B on 16,384 H100 GPUs over 54 days and recorded 419 unexpected hardware interruptions, roughly one every three hours. GPU hardware faults were the most common category at 184 of the 419 incidents.
How long does it take to replace a failed GPU?
With on-site spares: detection in minutes, a technician swaps the GPU in 2-4 hours, and validation takes 1-2 hours. Back in production the same day. Without spares, NVIDIA standard RMA takes 7-14 business days. OEM advance replacement ships next business day with a qualifying support contract.
What is silent data corruption in GPU training?
Silent data corruption (SDC) is when the GPU computes wrong results without logging any error. Meta reported 6 SDC incidents during their 54-day Llama 3 training run on 16,384 H100s. Research has shown that SDC can cause training loss spikes and push models toward different optima, even when the per-step perturbations appear small.
What are the early warning signs of a GPU failure?
Three signs of a GPU degrading before it fails outright: correctable ECC error counts rising over time, thermal throttling events under normal load, and bandwidth speeds slowing down. NVIDIA's DCGM (Data Center GPU Manager) runs tiered diagnostics from Level 1 through Level 4 to detect these problems.
Coverage creates a minimum value for what your GPUs are worth at a future date. If they sell below the floor, the policy pays you the difference.
Learn how it works →