When a GPU Dies in Production

Mar 3, 2026·Bernie Margulies, CEO·AC Research

After detecting a GPU failure, you file a claim with your supplier and wait for a replacement to arrive. This can be as quick as next business day if you have advance replacement in your contract, to 3-14 business days without. Once it arrives, a technician swaps the card (2-4 hours), runs diagnostics, and returns the node to production. If you keep spares on-site, the full cycle takes under a day.

Do you need to worry about GPU failures

A single GPU is fairly reliable on its own. On average, individual H100s only fail every few tens of thousands of hours.^{[1]Epoch AI, "Hardware Failures Won't Limit AI Scaling" (2024)https://epoch.ai/blog/hardware-failures-wont-limit-ai-scaling} Replacing a single GPU is a rare event, at least for smaller AI clusters.

But even if GPU failures are rare, they are still the most common type of hardware interruption. Meta trained Llama 3 405B on 16,384 H100 GPUs over 54 days and recorded 419 unexpected hardware interruptions, or one failure every three hours.^{[2]Dubey et al., "The Llama 3 Herd of Models," Meta (2024)https://arxiv.org/abs/2407.21783}

Meta Llama 3 405B training: 419 hardware interruptions over 54 days on 16,384 H100s. GPU hardware includes GPU faults (148), SRAM errors (19), and system processor faults (17).

How to detect a failing GPU

NVIDIA's driver will output XID errors when a GPU fails. You can read about each error code here. For example, XID 48 means the GPU has an "uncorrectable double-bit ECC error in memory"^{[3]NVIDIA, "XID Errors" (accessed March 2026)https://docs.nvidia.com/deploy/xid-errors/} and pretty much needs to be replaced.^{[4]NVIDIA, "Dynamic Page Retirement" (accessed March 2026)https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html}

DCGM, NVIDIA's Data Center GPU Manager, also helps with ongoing health monitoring. It runs tiered diagnostics from Level 1 (basic readiness checks) through Level 4 (full analysis covering interfaces, memory, thermal and power constraints).^{[5]NVIDIA, "DCGM Diagnostics" (accessed March 2026)https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html}

The most frustrating situation is when there are no errors. Silent data corruption (SDC) is when the GPU computes wrong results without logging any error. Meta reported 6 SDC incidents during their 54-day Llama 3 training run.^{[2]Dubey et al., "The Llama 3 Herd of Models," Meta (2024)https://arxiv.org/abs/2407.21783} Research on unhealthy nodes has shown that SDC can cause training loss spikes and push models toward different optima, even when the per-step perturbations appear small.^{[6]Ma et al., "Understanding Silent Data Corruption in LLM Training" (2025)https://arxiv.org/abs/2502.12340}

Early warning signs^{[7]NVIDIA, "GPU Debug Guidelines" (accessed March 2026)https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html} of a GPU degrading before it fails outright :

Correctable ECC error counts rising over time
Thermal throttling events under normal load
Bandwidth speeds slowing down

How to protect against GPU failure

Warranties and RMA

NVIDIA typically offers a standard warranty of 1-3 years for datacenter GPUs.^{[8]Based on industry experience and knowledge of NVIDIA datacenter GPU warranty programs} The warranty includes RMA (Return Merchandise Authorization), a process where you ship the defective GPU back to the manufacturer and they ship you a replacement.^{[9]NVIDIA, "RMA Process" (accessed March 2026)https://docs.nvidia.com/deploy/rma-process/index.html} NVIDIA's expedited RMA option typically takes 3-5 business days. Standard RMA typically takes 7-14 business days.

OEMs are Original Equipment Manufacturers, like Dell, HPE, Lenovo, and Supermicro. They put together the NVIDIA cards into complete servers, but also offer additional services for these situations.^{[10]NVIDIA, "Enterprise Support Services" (accessed March 2026)https://www.nvidia.com/en-us/support/enterprise/} A common service is to offer advance replacement with next-business-day SLAs. Here's Supermicro's RMA portal for example.^{[11]Supermicro, "RMA" (accessed March 2026)https://www.supermicro.com/en/support/rma}

RMA path	Typical timeline
NVIDIA standard RMA	7-14 business days
NVIDIA expedite RMA	3-5 business days
OEM advance replacement	Next business day
Self-service with on-site spares	2-6 hours (swap + initial validation)

Spare GPU strategy

Keeping cold spares on-site is the fastest path back to production. Buying spares in advance feels expensive, but emergency procurement costs significantly more, especially due to downtime. Your OEM or value-added reseller (VAR) will offer spare and replacement services under names like "3 Year Premier Next-Business-Day Response" or "ProSupport with Next-Business-Day Onsite Service." The right strategy depends on your workload type, tolerance for downtime, and cluster size.

Spares are only worthwhile if you have the team and resources to swap them in quickly, or your supplier offers a service to do it for you. PCIe GPUs are standard cards that slide into a PCIe x16 slot, but SXM GPUs require more involved replacement. Lenovo's service manual for SXM5 systems documents a multi-step procedure: powering down, removing the GPU tray assembly, cleaning and reapplying thermal paste, seating the replacement module onto the baseboard with a torque screwdriver, and reassembling.^{[12]Lenovo, "Install an SXM5 GPU, ThinkSystem SR675 V3" (accessed March 2026)https://pubs.lenovo.com/sr675-v3/install_an_sxm_gpu} This is 2-4 hours of hands-on work.

Checkpointing for training

Checkpointing saves training progress at regular intervals. When a failure occurs, you restart from the most recent checkpoint instead of from the beginning. Many training frameworks, like PyTorch, support checkpointing.^{[13]PyTorch, "Reducing Checkpointing Times by Over 10x with Distributed Asynchronous Checkpointing" (2024)https://pytorch.org/blog/reducing-checkpointing-times/}

Automatic failover for inference

Inference services typically run multiple replicas of the same model behind a load balancer. When a GPU fails, the load balancer routes requests to the remaining healthy replicas. The failed node gets replaced without any downtime visible to end users. NVIDIA's Dynamo framework takes this further by migrating in-progress requests from a failing node to a healthy one mid-inference.^{[14]NVIDIA, "Dynamo Fault Tolerance" (accessed March 2026)https://docs.nvidia.com/dynamo/latest/user-guides/fault-tolerance}

How long replacement takes

Total time from failure detection to node back in production.

With on-site spares and automated health monitoring: detection in minutes, a technician swaps the GPU in 2-4 hours, and validation takes another 1-2 hours for DCGM diagnostics. Back in production the same day.

Without spares, standard NVIDIA RMA takes 7-14 business days from when they receive the defective GPU. Or OEM advance replacement ships next business day with a qualifying support contract.

One detail that delays RMA if skipped: NVIDIA requires a diagnostic log from nvidia-bug-report.sh with every RMA request.^{[9]NVIDIA, "RMA Process" (accessed March 2026)https://docs.nvidia.com/deploy/rma-process/index.html}

How to validate a replacement GPU

The replacement GPU needs to be validated before the node is returned to production. The following checklist is based on NVIDIA's GPU Debug Guidelines^{[7]NVIDIA, "GPU Debug Guidelines" (accessed March 2026)https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html} and DCGM Diagnostics^{[5]NVIDIA, "DCGM Diagnostics" (accessed March 2026)https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html} documentation.

BMC verification. Confirm through the server's Redfish interface^{[15]NVIDIA, "DGX H100 Redfish API Support" (accessed March 2026)https://docs.nvidia.com/dgx/dgxh100-user-guide/redfish-api-supp.html} that the GPU shows a healthy state, correct model identification, matching firmware version, and full PCIe lane count (16/16).
DCGM diagnostics. Run Level 3 or Level 4. Tests interfaces, memory, thermal response, and power draw. Takes 1-2 hours.^{[5]NVIDIA, "DCGM Diagnostics" (accessed March 2026)https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html}
Memory check. Use DCGM framebuffer diagnostics to look for pending page retirements.^{[4]NVIDIA, "Dynamic Page Retirement" (accessed March 2026)https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html} A replacement GPU with pending retirements is suspect.
NVLink topology. Run nvidia-smi nvlink --status^{[16]NVIDIA, "nvidia-smi Documentation" (accessed March 2026)https://docs.nvidia.com/deploy/nvidia-smi/index.html} to confirm all links are active and at rated bandwidth.
NCCL bandwidth. Run all_reduce_perf from nccl-tests^{[17]NVIDIA, "NCCL Tests" (accessed March 2026)https://github.com/NVIDIA/nccl-tests} to measure actual GPU-to-GPU communication throughput. Compare against the known baseline for your topology.
XID monitoring. Watch for any XID errors^{[3]NVIDIA, "XID Errors" (accessed March 2026)https://docs.nvidia.com/deploy/xid-errors/} during diagnostics and testing. Any XID 48 or XID 79 means the replacement GPU is itself defective.

Only return the node to production workloads after every step passes.

How often do GPUs fail

Epoch AI modeled failure rates across cluster sizes using a per-GPU mean time between failures of roughly 50,000 hours, about 6 years (2024):^{[1]Epoch AI, "Hardware Failures Won't Limit AI Scaling" (2024)https://epoch.ai/blog/hardware-failures-wont-limit-ai-scaling}

Cluster size	Approximate failure interval
1 GPU	~6 years
1,000 GPUs	~2 days
16,384 GPUs (Meta Llama 3)	~3 hours
100,000 GPUs	~30 minutes

Meta still achieved over 90% effective training time despite the failure rate.^{[2]Dubey et al., "The Llama 3 Herd of Models," Meta (2024)https://arxiv.org/abs/2407.21783} Only 3 of the 419 incidents required significant manual intervention. Automated detection, job restart, and checkpoint recovery handled the rest.

What causes GPU failures

HBM, the memory stacked directly on top of the GPU die, is a common source of failures that require GPU replacement.^{[2]Dubey et al., "The Llama 3 Herd of Models," Meta (2024)https://arxiv.org/abs/2407.21783} ECC corrects single-bit errors automatically, but uncorrectable double-bit errors mean the GPU needs to be swapped.

A broken NVLink stalls the entire training job because the collective blocks until every participant completes.^{[18]NVIDIA, "Multi-Node NVLink Troubleshooting Guide" (accessed March 2026)https://docs.nvidia.com/multi-node-nvlink-systems/nvdebug-guide/troubleshooting_guide.html} Firmware mismatches between GPUs are a common trigger.

Without sufficient power and cooling, GPUs will have throttling: the clock speed drops and performance degrades without a clean error signal.^{[19]NVIDIA, "NVML API Reference: Clock Throttle Reasons" (accessed March 2026)https://docs.nvidia.com/deploy/nvml-api/group__nvmlClocksThrottleReasons.html}

References

Frequently Asked Questions

How often do GPUs fail in a large cluster?

A single GPU has a mean time between failures of roughly 50,000 hours, about 6 years. At scale, failures are frequent. Meta trained Llama 3 405B on 16,384 H100 GPUs over 54 days and recorded 419 unexpected hardware interruptions, roughly one every three hours. GPU hardware faults were the most common category at 184 of the 419 incidents.

How long does it take to replace a failed GPU?

With on-site spares: detection in minutes, a technician swaps the GPU in 2-4 hours, and validation takes 1-2 hours. Back in production the same day. Without spares, NVIDIA standard RMA takes 7-14 business days. OEM advance replacement ships next business day with a qualifying support contract.

What is silent data corruption in GPU training?

Silent data corruption (SDC) is when the GPU computes wrong results without logging any error. Meta reported 6 SDC incidents during their 54-day Llama 3 training run on 16,384 H100s. Research has shown that SDC can cause training loss spikes and push models toward different optima, even when the per-step perturbations appear small.

What are the early warning signs of a GPU failure?

Three signs of a GPU degrading before it fails outright: correctable ECC error counts rising over time, thermal throttling events under normal load, and bandwidth speeds slowing down. NVIDIA's DCGM (Data Center GPU Manager) runs tiered diagnostics from Level 1 through Level 4 to detect these problems.

Bridging GPU operators and financing partners

We help emerging neoclouds find financing partners, and help financing partners enhance story credit with GPU collateral management and residual value insurance solutions.

Learn how it works →

ShareLinkedIn X Facebook