NVIDIA Software Ecosystem for AI

Q: What is NVIDIA CUDA?

CUDA is NVIDIA’s parallel computing platform and programming model for NVIDIA GPUs. Released in February 2007, it lets developers write GPU-accelerated code and gives higher-level libraries like cuDNN, NCCL, and TensorRT a common foundation. In AI, CUDA is the layer most frameworks and GPU tooling are built around.

Q: Can you switch from NVIDIA CUDA to AMD ROCm?

Yes, but the difficulty depends on how much NVIDIA-specific code you use. Simple workloads can often move with HIPIFY and backend changes, while custom CUDA kernels, inline PTX, and CUDA-tuned infrastructure usually require meaningful rewrites, retesting, and performance work. The migration problem is not just syntax, but replacing the surrounding libraries, tooling, and operational knowledge.

Q: Why is NVIDIA’s software stack a moat?

Because the moat sits above the chip. CUDA is the default path many teams use to get from model code to fast GPU execution, and NVIDIA has spent years layering optimized libraries, framework integrations, deployment tools, and enterprise support on top of it. A competitor has to match not just GPU specs, but the performance tuning, developer workflows, and production reliability built around the stack.

Mar 23, 2026·Illia Kasian, CTO·AC Research

NVIDIA holds somewhere between 86% and 92% of the data center GPU market.^{[1]NVIDIA GPU Market Share 2024–2026: 87% Peak, What Comes Next, Silicon Analysts (February 2026); Jon Peddie Research, GPU Market Share Q3 2025https://siliconanalysts.com/analysis/nvidia-ai-accelerator-market-share-2024-2026} But analysts, competitors, and NVIDIA itself point to software as their main moat.

CUDA, NVIDIA's parallel computing platform, shipped in 2007. In the 19 years since, NVIDIA has built 400+ optimized libraries on top of it and 6 million developers write CUDA code.^{[2]NVIDIA CEO Jensen Huang, NVIDIA keynote address (December 2024); NVIDIA Developer Forums, "CUDA is NVIDIA’s platform for accelerated computing" (March 2026)https://forums.developer.nvidia.com/t/cuda-is-nvidia-s-platform-for-accelerated-computing-and-the-foundation-for-gpu-computing/363954}^{[3]NVIDIA CUDA-X GPU-Accelerated Librarieshttps://developer.nvidia.com/cuda/cuda-x-libraries} PyTorch and TensorFlow both use CUDA as their default GPU backend. AMD's ROCm is the closest alternative, but the ecosystem gap is wide: the nvidia/cuda Docker image has been pulled 105 million times versus under 1 million for rocm/pytorch.

What the NVIDIA software stack is

NVIDIA AI Enterprise

licensing, support, security

Dynamo, NIM

distributed serving, deploy

PyTorch, JAX, TensorFlow

frameworks

TensorRT, TensorRT-LLM

inference

cuDNN, NCCL

deep learning

cuBLAS, cuFFT, cuSPARSE

math

C U D A

parallel compute

The stack runs in layers. Each layer depends on the one below it. CUDA is the foundation; every other library compiles down to it.

CUDA is the foundation. It extends C++ with a handful of keywords that let developers write kernels, functions that run on the GPU instead of the CPU. You write .cu files, and the nvcc (NVIDIA CUDA Compiler) splits them: kernels compile to PTX, an intermediate instruction set that the GPU driver then translates to machine code for whichever GPU is installed. CPU code goes to your regular C++ compiler. Released February 2007.^{[4]NVIDIA CUDA Platform for Accelerated Computinghttps://developer.nvidia.com/cuda}

cuBLAS, cuFFT, and cuSPARSE sit one layer up. The “cu” prefix means CUDA. cuBLAS (CUDA Basic Linear Algebra Subroutines) handles linear algebra, specifically the matrix multiplications that make up most of a neural network's compute. cuFFT (CUDA Fast Fourier Transform) handles signal processing. cuSPARSE handles sparse matrices, matrices where most values are zero, which are common in recommendation models, graph neural networks, and pruned models. These are the math building blocks that higher-level libraries call into.^{[3]NVIDIA CUDA-X GPU-Accelerated Librarieshttps://developer.nvidia.com/cuda/cuda-x-libraries}

cuDNN (CUDA Deep Neural Network library) provides the core operations that neural networks are built from: convolutions, attention, normalization. Each operation has been manually optimized by NVIDIA engineers for each GPU generation.^{[5]NVIDIA cuDNN Documentationhttps://developer.nvidia.com/cudnn} When PyTorch trains a model, it calls cuDNN for these operations. The speed of your training run is largely cuDNN's speed.

NCCL (NVIDIA Collective Communications Library, pronounced “nickel”) handles multi-GPU communication over NVLink.^{[6]NVIDIA NCCL Documentationhttps://developer.nvidia.com/nccl} Training a model across 8 GPUs or 8,000 GPUs requires coordinating training updates. NCCL handles that coordination: collecting results from every GPU, combining them, and distributing the combined result back so every GPU stays in sync.

TensorRT (Tensor Runtime) compiles trained models into optimized inference engines. It fuses layers (combines multiple operations into one to reduce memory reads), selects the fastest kernels per GPU generation, and applies quantization (reducing numerical precision to use less memory and compute faster).^{[7]NVIDIA TensorRT SDKhttps://developer.nvidia.com/tensorrt} TensorRT-LLM extends this to large language models: in-flight batching (processing multiple user requests simultaneously even at different stages), KV-cache management (reusing previously computed values so the model does not recalculate them for each new token), and tensor parallelism (splitting a single model across GPUs when it is too large to fit on one).

Dynamo is NVIDIA's distributed inference framework, open-sourced at GTC in March 2025 and in production as of 2026.^{[8]NVIDIA Dynamo Open-Source Library Accelerates and Scales AI Reasoning Models, NVIDIA Newsroom (March 2025)https://nvidianews.nvidia.com/news/nvidia-dynamo-open-source-library-accelerates-and-scales-ai-reasoning-models}^{[9]NVIDIA Enters Production With Dynamo, NVIDIA Newsroom (2026)https://nvidianews.nvidia.com/news/nvidia-enters-production-with-dynamo-the-broadly-adopted-inference-operating-system-for-ai-factories} It handles request routing, batching, and GPU scheduling across a fleet of servers. Supports different LLM inference backends such as vLLM, TensorRT-LLM, and SGLang.

NIM (NVIDIA Inference Microservices) packages optimized models into Docker containers with standardized API endpoints. Pull a container, deploy, serve. Included with NVIDIA AI Enterprise.

NVIDIA AI Enterprise is the enterprise wrapper: support, security patches, access to NIM. Licensed per GPU per year.^{[10]NVIDIA AI Enterprise Licensing and Pricing Guidehttps://docs.nvidia.com/ai-enterprise/planning-resource/licensing-guide/latest/pricing.html}

Layer	Key software	What it does	When it matters
Core runtime	CUDA	Parallel compute API for NVIDIA GPUs	Every GPU workload
Math	cuBLAS, cuFFT, cuSPARSE	Linear algebra, signal processing, sparse ops	Training, HPC, scientific computing
Deep learning	cuDNN, NCCL	Neural net primitives, multi-GPU communication	Training and fine-tuning
Inference	TensorRT, TensorRT-LLM	Model compilation, quantization, LLM optimization	Deploying models to production
Frameworks	PyTorch, JAX, TensorFlow	Training, experimentation, model development	All ML development
Deployment	Dynamo, NIM	Distributed serving, containerized inference	Production inference at scale
Enterprise	NVIDIA AI Enterprise	Licensing, support, security	Enterprise deployments

How CUDA became the default

Ian Buck, a Stanford PhD student, built a system called Brook in 2003. Brook let programmers run general-purpose code on graphics cards, which at the time were only used for rendering video games. NVIDIA hired Buck in 2004 and paired him with John Nickolls, the company's director of architecture for GPU computing.

Together they turned Brook into CUDA. The SDK shipped February 15, 2007.^{[4]NVIDIA CUDA Platform for Accelerated Computinghttps://developer.nvidia.com/cuda}

It was not an obvious bet. NVIDIA reportedly spent over a billion dollars building CUDA, with R&D running as high as 25-30% of revenue during the 2008-2010 period. It took nearly a decade before the investment showed clear returns, and industry watchers openly questioned whether it would pay off at all.

For five years, CUDA was a niche tool. Researchers in physics, finance, and molecular simulation used it for parallel computation. The deep learning boom had not started.

NVIDIA kept one GPU architecture from gaming laptops to data center servers. A student learns CUDA on a $300 GeForce and runs the same code on an H100. In contrast, AMD split into RDNA for consumers and CDNA for data center, two architectures, two software stacks.

Then AlexNet happened. In 2012, Alex Krizhevsky trained a convolutional neural network on two GTX 580 GPUs using CUDA and won the ImageNet competition by a wide margin.^{[11]Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton, "ImageNet Classification with Deep Convolutional Neural Networks" (2012)https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html} That result convinced the machine learning community that GPU training worked. Within three years, every major ML framework had added CUDA as its default backend.

AMD launched ROCm (Radeon Open Compute platforM) in 2016, nine years after CUDA.^{[12]AMD ROCm Documentationhttps://rocm.docs.amd.com/} Intel launched oneAPI in 2020, thirteen years later. Google's TorchTPU was reported in December 2025, eighteen years later.^{[13]Reuters, "Google teams with Meta’s PyTorch to chip away at Nvidia’s moat" (December 18, 2025)https://finance.yahoo.com/news/google-teams-meta-pytorch-chip-161749510.html} Each entered an ecosystem where the libraries, documentation, Stack Overflow answers, university courses, and muscle memory all pointed to CUDA.

The result is a developer flywheel: more libraries attract more developers, which attracts more framework support, which makes NVIDIA invest in more libraries.

NVIDIA optimized cuDNN and TensorRT for PyTorch and TensorFlow directly, so the framework teams at Google and Meta never had to handle low-level GPU work themselves. The NVIDIA Deep Learning Institute has trained hundreds of thousands of developers, cementing CUDA as the skill universities teach and employers hire for. R&D spending hit $12.9 billion in FY2025, up 49% year over year.^{[14]NVIDIA Announces Financial Results for Fourth Quarter and Fiscal 2025, NVIDIA Newsroomhttps://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-fourth-quarter-and-fiscal-2025}

How the layers chain together

The training path

Your code (typically PyTorch) → cuDNN (runs each operation on the GPU) → cuBLAS (handles the matrix math underneath) → NCCL (keeps multiple GPUs in sync after each step). Every layer in the stack is involved in a single training run.

That is the textbook path. At the frontier, labs diverge. xAI trained Grok 3 on 200,000 H100/H200 GPUs using JAX (Google's ML framework, the main alternative to PyTorch) instead of PyTorch. DeepSeek trained V3 on 2,048 H800s using custom low-level kernels for GPU communication and matrix multiplication, bypassing standard NVIDIA libraries for specific operations where they needed deeper hardware control.^{[15]DeepSeek-AI, "DeepSeek-V3 Technical Report" (arXiv:2412.19437, December 2024)https://arxiv.org/html/2412.19437v1} Meta uses NCCLX, a custom fork of NCCL, for Llama 4 training. The standard stack is widely used for fine-tuning and smaller-scale training; at the frontier, the lock-in is at the hardware and CUDA runtime level.

The inference path

Trained model → TensorRT or TensorRT-LLM (compilation and quantization) → Dynamo (request routing and scheduling across servers) → NIM (containerized deployment with API endpoints).

In production, most inference platforms use open-source engines like vLLM and SGLang rather than TensorRT-LLM. Fireworks AI built FireAttention, their own CUDA kernel library, and also partnered with AMD to support its MI325X and MI355X GPUs. DeepSeek runs fully custom inference with DeepGEMM (their own matrix multiply library) and FlashMLA (a custom attention kernel). As with training, NVIDIA's prescribed stack is the starting point, and most production deployments customize or replace parts of it.

How the software connects to the hardware

NVIDIA's libraries are not generic. They ship hand-optimized code for each GPU architecture, so the same operation runs differently on an A100 than on an H100 or a Blackwell GPU. This tight coupling between software and hardware is a core part of why NVIDIA GPUs outperform their specs on paper.

cuDNN, for example, auto-selects kernels per GPU generation. A convolution on an H100 uses different instructions than on an A100. Developers write one function call; cuDNN figures out the rest. TensorRT does something similar for inference, activating hardware features that general frameworks leave on the table: Tensor Cores, FP8 precision on Hopper and Blackwell, sparsity support on Ampere and later.

This coupling goes deep. FlashAttention is the algorithm that made transformer attention fast enough to train today's large language models. FlashAttention-3 was built exclusively for Hopper GPUs. Running it on an A100 means falling back to FlashAttention-2. Running it on AMD requires a separate port. The pattern repeats: cutting-edge optimizations target specific NVIDIA architectures first.

Each new GPU generation demands software rewrites, which deepens the dependency. Teams that optimize for one NVIDIA generation carry that knowledge forward to the next, but not across to competitors. CUDA 13.1 (December 2025) introduced CUDA Tile, a higher-level programming model where NVIDIA handles architecture-specific optimization under the hood.^{[16]NVIDIA, "NVIDIA CUDA 13.1 Powers Next-Gen GPU Programming with NVIDIA CUDA Tile" (December 4, 2025)https://developer.nvidia.com/blog/nvidia-cuda-13-1-powers-next-gen-gpu-programming-with-nvidia-cuda-tile-and-performance-gains/} The more abstraction NVIDIA adds, the more room it has to optimize, and the harder it gets for competitors to replicate.

Why the stack is a moat

NVIDIA controls what happens between your code and the GPU, and each layer of that stack makes the next one harder to replace.

First, developers build where the tooling already works. CUDA has the tutorials, examples, university courses, framework integrations, and debugging history. Once a team has trained models, tuned kernels, and built internal tooling around CUDA, switching means retraining engineers and rewriting infrastructure, not just swapping chips.

Second, library depth compounds over time. cuDNN, NCCL, TensorRT, and the broader CUDA-X stack encode years of performance work for common workloads. Competitors can match parts of the stack, but they have to reproduce a long tail of kernels, edge cases, documentation, and framework compatibility before the alternative feels equally safe.

Third, NVIDIA ships hardware and software together. New GPU generations are not just faster chips. They arrive with updated compilers, libraries, and kernels that expose the new hardware quickly. That tight coupling means NVIDIA can turn architecture changes into usable performance sooner than rivals that only compete on silicon.

Fourth, the stack reaches all the way into production. The moat is not only training code. It extends into inference optimization, cluster-level communication, deployment tooling, and enterprise support. That gives buyers one vendor for development, scaling, and operations, which lowers adoption risk even when competitors have comparable hardware on paper.

Put together, this is why NVIDIA's moat is cumulative. A rival does not need to beat one library or one GPU. It has to offer a credible replacement for the workflows, performance tuning, and operational confidence that sit on top of the hardware.

Why switching is hard

How hard it is depends on where you sit in the stack. PyTorch code that never touches a GPU kernel can switch to ROCm or XLA (Google's compiler for TPUs and GPUs) backends. Custom CUDA kernels or libraries built around NVIDIA hardware mean months of rewriting.

The dependency chain is the core problem: your code → PyTorch → cuDNN → CUDA. You need every library in the chain to have a working equivalent on the target platform.

AMD's HIP tool translates simple CUDA code automatically with under 5% manual changes.^{[12]AMD ROCm Documentationhttps://rocm.docs.amd.com/} Custom kernels, inline PTX assembly, and CUDA-specific memory patterns require manual rewriting. The documentation gap compounds it: Google a CUDA error and you get dozens of results; a ROCm error gets you two, one outdated.

AMD's HIP tool can automatically translate most CUDA code, but the long tail of edge cases, specialized libraries, and tooling gaps remain significant as of early 2026. AMD once funded ZLUDA, a project to run unmodified CUDA code on AMD GPUs, then killed it.

Regulators have noticed. The U.S. DOJ issued subpoenas to NVIDIA in 2024-2025 investigating GPU allocation practices.^{[17]U.S. Department of Justice, Antitrust Investigation into NVIDIA GPU Allocation Practices; Bloomberg, "Nvidia Gets DOJ Subpoena in Escalating Antitrust Probe" (September 2024)https://finance.yahoo.com/news/nvidia-gets-doj-subpoena-escalating-210038371.html} France's competition authority flagged CUDA dependency as a structural barrier to competition.^{[18]Autorité de la concurrence, Investigation into NVIDIA Market Practices (2024)https://www.autoritedelaconcurrence.fr/}

Alternatives to the NVIDIA stack

AMD ROCm is the closest competitor. The MI300X and MI355X are competitive on raw specs, but CUDA often delivers better real-world performance on comparable hardware due to more mature kernel tuning, though the gap varies by workload. SemiAnalysis projects AMD's AI GPU market share reaching 10% by mid-2026.^{[19]SemiAnalysis, AMD AI GPU Market Share Projections (2025); Silicon Analysts, NVIDIA AI Accelerator Market Share 2024–2026 (February 2026)https://siliconanalysts.com/analysis/nvidia-ai-accelerator-market-share-2024-2026}

Google TPU with TorchTPU is the most significant challenge to CUDA's grip on PyTorch users. Google partnered with Meta to develop TorchTPU, reported in December 2025, to allow PyTorch models to run natively on TPUs without rewriting code.^{[13]Reuters, "Google teams with Meta’s PyTorch to chip away at Nvidia’s moat" (December 18, 2025)https://finance.yahoo.com/news/google-teams-meta-pytorch-chip-161749510.html} TPU v7 Ironwood delivers 4,614 TFLOPS peak FP8 per chip.^{[20]Google Cloud, TPU v7 (Ironwood) Documentationhttps://docs.cloud.google.com/tpu/docs/tpu7x} TPUs are cloud-only: you rent them through Google Cloud, you cannot buy or resell them. Google trained Gemini 3 entirely on TPUs with zero CUDA involvement.

Intel oneAPI targets heterogeneous computing across CPUs, GPUs, and FPGAs (field-programmable gate arrays, chips that can be reconfigured for different tasks). Intel's Gaudi AI accelerators are being discontinued, replaced by Jaguar Shores in 2026. Intel's AI hardware strategy is in transition.

AWS Trainium is Amazon's custom AI accelerator. AWS activated Project Rainier with approximately 500,000 Trainium2 chips for Anthropic, making it the largest non-NVIDIA AI training cluster in production.^{[21]DCD, "AWS activates Project Rainier cluster of nearly 500,000 Trainium2 chips" (2025)https://www.datacenterdynamics.com/en/news/aws-activates-project-rainier-cluster-of-nearly-500000-trainium2-chips/} Uses Amazon's Neuron SDK. Available only through AWS.

Groq built a specialized inference chip (the LPU). In December 2025, NVIDIA paid approximately $20 billion to license Groq's technology and hire its leadership, structured as a licensing deal rather than a full acquisition to avoid triggering antitrust review.^{[22]CNBC, "Nvidia buying AI chip startup Groq's assets for about $20 billion" (December 24, 2025)https://www.cnbc.com/2025/12/24/nvidia-buying-ai-chip-startup-groq-for-about-20-billion-biggest-deal.html}

Chinese alternatives are advancing under U.S. export controls. Huawei's Ascend 910B with MindSpore is the most complete non-NVIDIA stack in production. Zhipu AI trained GLM-5 on 100,000 Ascend chips.^{[23]South China Morning Post, "Zhipu AI breaks US chip reliance with first major model trained on Huawei stack" (2025)https://www.scmp.com/tech/tech-war/article/3339869/zhipu-ai-breaks-us-chip-reliance-first-major-model-trained-huawei-stack} Baidu built a 30,000-chip Kunlun cluster for AI model training.^{[24]Technology.org, "Baidu Lights Up 30,000-Strong Kunlun Chip Cluster" (April 2025)https://www.technology.org/2025/04/25/baidu-lights-up-30000-strong-kunlun-chip-cluster-aims-to-train-deepseek-like-ai-models/} These ecosystems prove NVIDIA's software stack can be replicated when export controls remove the option of using it.

OpenAI Triton takes a different approach. It is a Python-based language for writing GPU kernels that compile across CUDA and AMD GPUs. Instead of relying as heavily on vendor-specific libraries, developers can write Triton kernels that the compiler optimizes per hardware target. If enough critical kernels move into Triton, the hardware layer becomes more interchangeable.

Platform	Ecosystem maturity	Migration from CUDA	Hardware access
NVIDIA CUDA	19 years, 400+ libs, 6M devs	N/A (incumbent)	Buy or rent
AMD ROCm	10 years, growing	HIP auto-translation, <5% manual	Buy or rent
Google TPU	8+ years, different ecosystem	TorchTPU (Dec 2025)	Rent only (Google Cloud)
AWS Trainium	3+ years	Neuron SDK (different paradigm)	Rent only (AWS)
Huawei Ascend	5+ years, China-only	MindSpore (different paradigm)	Buy (China only)

References

Frequently Asked Questions

What is NVIDIA CUDA?

CUDA is NVIDIA’s parallel computing platform and programming model for NVIDIA GPUs. Released in February 2007, it lets developers write GPU-accelerated code and gives higher-level libraries like cuDNN, NCCL, and TensorRT a common foundation. In AI, CUDA is the layer most frameworks and GPU tooling are built around.

Can you switch from NVIDIA CUDA to AMD ROCm?

Yes, but the difficulty depends on how much NVIDIA-specific code you use. Simple workloads can often move with HIPIFY and backend changes, while custom CUDA kernels, inline PTX, and CUDA-tuned infrastructure usually require meaningful rewrites, retesting, and performance work. The migration problem is not just syntax, but replacing the surrounding libraries, tooling, and operational knowledge.

Why is NVIDIA’s software stack a moat?

Because the moat sits above the chip. CUDA is the default path many teams use to get from model code to fast GPU execution, and NVIDIA has spent years layering optimized libraries, framework integrations, deployment tools, and enterprise support on top of it. A competitor has to match not just GPU specs, but the performance tuning, developer workflows, and production reliability built around the stack.

Bridging GPU operators and financing partners

We help emerging neoclouds find financing partners, and help financing partners enhance story credit with GPU collateral management and residual value insurance solutions.

Learn how it works →

ShareLinkedIn X Facebook