Disaggregated Inference: How NVIDIA, AWS, and Cerebras Are Rethinking LLM Inference

Apr 2, 2026·Illia Kasian, CTO·AC Research

Inference, the process of generating an output from an LLM, can be broken down into two phases: prefill and decode. Prefill is compute-bound, meaning it is limited by how fast the chip can do math. Decode is memory-bound, meaning it is limited by how fast we can read the saved results from memory. When both phases share the same hardware, they interfere with each other.

Disaggregated inference solves this by splitting the work across dedicated hardware. In its simplest form, some GPUs handle prefill while others handle decode. AWS and Cerebras go further by using different chip types for each phase, but both chips still run the full model. NVIDIA and Groq split the model itself, running different layers on GPU and SRAM-based chips.

Prefill and decode

LLMs produce outputs by predicting one token (a word fragment, or other piece of data) at a time and then using that newly generated token, along with all previous tokens, to predict the next one. Done naively, this means a lot of repeated calculations, since we pass all previous tokens every time.

To avoid this, the model saves intermediate results in memory in what is called the KV cache (key-value cache). The cache grows with every token: for a large model processing a long prompt, it can reach tens of gigabytes. The initial pass that processes the full prompt and builds the KV cache is called prefill. Prefill runs many tokens through the model in parallel, so the bottleneck is raw math throughput: it is compute-bound.

Every subsequent pass that generates one new token by reading the KV cache and model weights is called decode. Decode only processes one token at a time, so the chip spends most of its time waiting for data to load from memory rather than doing math: it is memory-bound.

KV cache removes redundant processing

Step 1 of 7

Software disaggregation

When GPUs serve multiple users at the same time, they batch their requests together. The model weights are loaded from memory once and shared across all requests in the batch, so more requests per batch means less time spent re-reading the same weights. Each request still gets its own KV cache, but they share the expensive weight reads. Since different users have different prompt lengths and ask questions at different times, a batch often contains a mix of requests in the prefill stage and the decode stage.

When that happens, the entire batch has to wait for the slowest operation. Prefill is the slow one (processing many tokens at once), so decode requests that could finish quickly end up waiting.

The DistServe paper (2024) quantified this interference, and follow-up analysis found that a single large prefill can inflate decode latency by 2-30x. The fix is to separate them into distinct pools (groups of GPUs), where one pool is dedicated to prefill and another to decode. Each pool batches only its own type of request, so they never interfere with each other.^{[1]Zhong, Y. et al., "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving," OSDI (2024)https://arxiv.org/abs/2401.09670}^{[3]Hao AI Lab, "Disaggregated Inference: 18 Months Later" (2025)https://haoailab.com/blogs/distserve-retro/}

By 2025, nearly every major serving framework supported prefill-decode disaggregation: vLLM, SGLang, NVIDIA Dynamo, llm-d, TensorRT-LLM, and MoonCake. The approach became the default for production LLM serving.^{[3]Hao AI Lab, "Disaggregated Inference: 18 Months Later" (2025)https://haoailab.com/blogs/distserve-retro/}

NVIDIA Dynamo, the open-source inference framework released at GTC 2025, added KV-cache-aware routing and a planner that dynamically shifts GPUs between prefill and decode pools based on latency targets: how fast the first token should arrive (time to first token) and how fast subsequent tokens should stream (time per output token). If decode latency rises, the planner moves GPUs from the prefill pool to decode, trading throughput for responsiveness.

On Blackwell hardware, Dynamo achieved up to 30x more requests served compared to non-disaggregated serving of DeepSeek-R1.^{[2]NVIDIA Technical Blog, "Introducing NVIDIA Dynamo: A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models" (2025)https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/}Separately, LMSYS benchmarked GB200 NVL72 with disaggregated serving using SGLang and measured 3.8x prefill throughput and 4.8x decode throughput versus H100. ^{[18]LMSYS Blog, "Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput" (2025)https://www.lmsys.org/blog/2025-09-25-gb200-part-2/}

Separate pools also allow independent scaling. If your users send long prompts but expect short answers, you add more prefill capacity. If they send short prompts but generate long outputs, you add more decode capacity. A single shared pool can't make that trade-off.

Hardware disaggregation

Software disaggregation uses identical GPUs in both pools. Today that typically means NVIDIA B200s: each GPU has 192 GB of HBM3e delivering 8 TB/s of memory bandwidth. The same chip handles both prefill and decode. ^{[15]NVIDIA, "NVIDIA B200 Tensor Core GPU" Datasheet (2024)https://www.nvidia.com/en-us/data-center/b200/}

The next step is using different hardware for each phase.

Prefill-decode: specialized chips for each phase

In prefill-decode hardware disaggregation, both pools run all layers of the model. The prefill pool can use chips with cheaper, slower memory (since prefill is compute-bound and doesn't need extreme bandwidth). The decode pool needs chips with fast memory (since decode reads the full model weights and KV cache for every token).

NVIDIA's GDDR7 experiment: Rubin CPX

NVIDIA announced the Rubin CPX in September 2025: a data center GPU with 128 GB of GDDR7 (a type of graphics memory normally found in consumer GPUs) instead of the expensive HBM (high-bandwidth memory). The idea was straightforward: since prefill doesn't need extreme memory bandwidth, use cheaper memory (GDDR7 costs over 50% less per GB than HBM) and put the savings into compute density. GDDR7 also draws less power than HBM, bringing the CPX to around 800W, under half of the HBM-equipped Rubin R200, which means more chips per rack. ^{[14]NVIDIA Newsroom, "NVIDIA Unveils Rubin CPX: A New Class of GPU Designed for Massive-Context Inference" (September 2025)https://nvidianews.nvidia.com/news/nvidia-unveils-rubin-cpx-a-new-class-of-gpu-designed-for-massive-context-inference}^{[16]SemiAnalysis, "Another Giant Leap: The Rubin CPX Specialized Accelerator Rack" (2025)https://newsletter.semianalysis.com/p/another-giant-leap-the-rubin-cpx-specialized-accelerator-rack}

NVIDIA pulled the CPX from its roadmap six months later at GTC 2026. The savings were real, but regular GPUs handle prefill fine, just less cost-efficiently. NVIDIA instead partnered with Groq on a different approach to disaggregation (explained later in this article). A similar specialized inference accelerator may return in NVIDIA's next Feynman generation around 2028, though possibly with different memory technology. ^{[6]Tom's Hardware, "Nvidia removes Rubin CPX accelerators from its roadmap" (2026)https://www.tomshardware.com/pc-components/gpus/nvidia-removes-rubin-cpx-accelerators-from-its-roadmap-groq-3-lpus-take-center-stage-as-cpx-is-removed}

AWS + Cerebras: Trainium for prefill, wafer-scale for decode

In March 2026, AWS and Cerebras announced a disaggregated inference partnership on Amazon Bedrock. Trainium chips handle prefill. Trainium is a general-purpose accelerator with HBM, not specifically designed for prefill, but it works well for compute-bound workloads. The KV cache then transfers to Cerebras CS-3 systems for decode. ^{[7]AWS and Cerebras, "Collaboration Aims to Set a New Standard for AI Inference Speed and Performance in the Cloud" (March 2026)https://press.aboutamazon.com/aws/2026/3/aws-and-cerebras-collaboration-aims-to-set-a-new-standard-for-ai-inference-speed-and-performance-in-the-cloud}

The Cerebras CS-3 runs the Wafer-Scale Engine 3 (WSE-3): a single chip the size of an entire silicon wafer (the dinner-plate-sized disc that chips are normally cut from).

In normal chip manufacturing, a silicon wafer is printed with hundreds of identical chips, then cut apart, tested, and packaged individually. Cerebras skips the cutting step and uses the entire wafer as one processor. Defects are inevitable at this scale, so the chip is designed with redundant cores that route around bad spots.

The result is 44 GB of on-chip SRAM (static RAM, the fast memory normally used for small caches inside processors) delivering 21 PB/s of memory bandwidth, roughly 2,600x more than a single NVIDIA B200. That extreme bandwidth makes it optimal for decoding. Cerebras reports decode speeds reaching 1,200 tokens per second and up to 4.5x improvement in P95 latency (the response time that 95% of requests fall under) on agentic workloads.^{[8]Cerebras, "The GPU Is Being Split in Half" (2026)https://www.cerebras.ai/blog/disaggregated-inference}^{[11]Cerebras, "Wafer-Scale Engine 3" Product Page (accessed April 2026)https://cerebras.ai/chip/}

SambaNova: GPUs for prefill, RDUs for decode

SambaNova, an AI chip company, follows the same pattern. GPUs handle prefill, then the KV cache transfers to SN50 RDUs (Reconfigurable Dataflow Units) for decode.

The RDU combines three memory tiers, each matched to a different job: a small amount of fast SRAM for combining intermediate results between operations, HBM for storing the model weights that get read every token, and a large pool of DDR (the standard server memory) for caching prompts across requests. A GPU uses HBM for everything. The RDU's layered approach puts each type of data on the cheapest memory that's fast enough for it.^{[17]SambaNova, "Solving the Decode Bottleneck: Why Agentic Inference Needs Hybrid Hardware" (2026)https://sambanova.ai/blog/agentic-inference-needs-hybrid-hardware}

NVIDIA + Groq: splitting the model itself

NVIDIA's partnership with Groq takes a fundamentally different approach. To understand it, it helps to know how an LLM is structured.

An LLM is built from many layers and each layer consists of two main blocks: an attention block (which decides what parts of the input to focus on) and a feed-forward block (which transforms each token's representation). In the previous examples, both prefill and decode run the full model, all layers, all blocks. The NVIDIA + Groq approach splits the two blocks within each layer onto different chips.

During decode, for every layer in the model: the Rubin GPU runs the attention block, then passes the intermediate results to a Groq LPU (language processing unit) which runs the feed-forward block, then passes the result back to the GPU for the next layer's attention. This loop repeats for every layer. NVIDIA calls it Attention-FFN Disaggregation (AFD).^{[4]SemiAnalysis, "NVIDIA: The Inference Kingdom Expands" GTC 2026 Analysis (2026)https://newsletter.semianalysis.com/p/nvidia-the-inference-kingdom-expands}^{[5]NVIDIA Technical Blog, "Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform" (2026)https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/}

The split makes sense because attention needs access to the KV cache, which is large and stored in HBM on the GPU. The feed-forward layers don't need the KV cache; they just need to read their own weights quickly, which makes SRAM on-chip memory sufficient. Groq's LP30 chip has 500 MB of on-chip SRAM delivering 150 TB/s of bandwidth per chip, almost 20x faster bandwidth than NVIDIA's B200 with 192 GB of HBM3e. ^{[5]NVIDIA Technical Blog, "Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform" (2026)https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/}^{[10]Groq, "LPU Architecture" (accessed April 2026)https://groq.com/lpu-architecture}

The trade-offs are capacity and transport. Only 500 MB of SRAM per chip means you'll need many chips for large models, and intermediate results must bounce between the GPU and LPU at every layer, adding network overhead. An LPX rack (NVIDIA's server rack built around Groq LPUs) packs 256 LP30 chips for 128 GB of SRAM total at 40 PB/s aggregate bandwidth, with 640 TB/s of chip-to-chip interconnect to keep that transfer fast.^{[5]NVIDIA Technical Blog, "Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform" (2026)https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/}

At GTC 2026, NVIDIA announced that Groq LPX racks will integrate with the Vera Rubin platform. Rubin GPUs handle prefill and attention during decode. Between attention steps, intermediate results transfer to the LPX rack for feed-forward execution, then return to the GPU for the next attention step. The combined system claims 35x higher inference throughput per megawatt compared to GB200 NVL72.^{[5]NVIDIA Technical Blog, "Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform" (2026)https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/}

Comparison

Approach	Prefill	Decode	What transfers	Benefit	Drawback
Software only	B200 GPU pool	B200 GPU pool	KV cache (once)	No new hardware, scale pools independently	Both pools use expensive HBM GPUs
Rubin CPX (shelved)	Rubin CPX (GDDR7)	Rubin R200 (HBM)	KV cache (once)	Cheaper, lower-power prefill chips	Cancelled; regular GPUs handle prefill fine
AWS + Cerebras	Trainium (HBM)	Cerebras CS-3 (SRAM)	KV cache (once, over EFA)	21 PB/s decode bandwidth, 1,200 tok/s	44 GB SRAM limits model size per chip
SambaNova	GPU (HBM)	SN50 RDU (SRAM + HBM + DDR)	KV cache (once)	Three memory tiers, right memory for each job	Smaller ecosystem than GPU-only
NVIDIA + Groq (AFD)	Rubin GPU (HBM)	Rubin GPU (attention) + LP30 LPU (FFN)	Intermediate results (per layer)	150 TB/s per chip for FFN, 35x throughput/MW	Transfers at every layer, 500 MB SRAM per chip

The network in between

Every form of splitting workloads adds a networking cost. In prefill-decode splits, the KV cache must transfer from the prefill pool to the decode pool. In attention-FFN splits, intermediate results bounce between the GPU and LPU at every layer. The performance gains from specialization have to be large enough to justify this transfer overhead.

For prefill-decode disaggregation, the KV cache is the main concern. It grows linearly with sequence length and model size. For a large model processing a long prompt, the cache can reach tens of gigabytes per request. Research found that KV cache transfer can account for up to 42% of total job completion time when network bandwidth is limited.^{[19]Xu, K. et al., "HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference" (2025)https://arxiv.org/abs/2502.03589}

For attention-FFN disaggregation, the transfers are smaller (just the intermediate results for a single token) but happen much more frequently: once per layer, per token. NVIDIA's LPX racks address this with 640 TB/s of chip-to-chip interconnect within the rack.^{[5]NVIDIA Technical Blog, "Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform" (2026)https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/}

The mitigation strategies for KV cache transfer:

Co-locate within the same rack. Put the prefill and decode chips physically next to each other so the KV cache stays on high-speed intra-node links (NVLink, PCIe) rather than crossing the data center network.^{[1]Zhong, Y. et al., "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving," OSDI (2024)https://arxiv.org/abs/2401.09670}
High-bandwidth fabric. A fabric is the network connecting chips within and across servers. AWS uses EFA (Elastic Fabric Adapter) for the Trainium-to-Cerebras link. Production systems targeting 400 Gbps or higher per node see manageable overhead.^{[7]AWS and Cerebras, "Collaboration Aims to Set a New Standard for AI Inference Speed and Performance in the Cloud" (March 2026)https://press.aboutamazon.com/aws/2026/3/aws-and-cerebras-collaboration-aims-to-set-a-new-standard-for-ai-inference-speed-and-performance-in-the-cloud}
KV cache compression and tiered storage. DeepSeek's 3FS file system aggregates thousands of SSDs across hundreds of storage nodes as a shared KV cache layer. LMCache and MoonCake manage KV cache across host DRAM (the server's main memory), NVMe SSDs (fast solid-state drives connected directly to the processor), and GPU HBM.^{[3]Hao AI Lab, "Disaggregated Inference: 18 Months Later" (2025)https://haoailab.com/blogs/distserve-retro/}

On lower-bandwidth networks (common on cheaper cloud GPU instances), the transfer overhead can erase the gains, especially for shorter prompts. Disaggregation needs fast networking to work.^{[13]FlowKV, "A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling" (2025)https://arxiv.org/abs/2504.03775}^{[19]Xu, K. et al., "HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference" (2025)https://arxiv.org/abs/2502.03589}

The trade-off across all forms of disaggregation is infrastructure complexity. More pools, more chip types, more routing logic, more things to schedule. Frameworks like NVIDIA Dynamo and llm-d handle much of this, but it is still more to run and maintain. Large deployments benefit enough to make it worth the trouble. Smaller ones are better off running everything on the same GPUs.

References

Frequently Asked Questions

What is disaggregated inference?

Disaggregated inference splits LLM serving into two phases, prefill and decode, and runs them on separate hardware pools. Prefill processes the input prompt and is compute-bound. Decode generates output tokens one at a time and is memory-bound. NVIDIA, Groq, Cerebras, and AWS are building systems that separate these phases so each cluster can be tuned for its specific bottleneck, promising higher throughput and lower cost per token.

Why is LLM decode memory-bound?

Decode generates one token at a time. Each token requires reading the model's full weight matrix and the accumulated KV cache, but performs relatively little math per byte loaded. The bottleneck is how fast data moves from memory to compute units, not how fast the compute units operate. The H200 proved this: same chip as the H100, only the memory was upgraded, and Llama 2 70B inference sped up 1.9x.

What hardware is used for disaggregated decode?

SRAM-based chips from Groq and Cerebras are purpose-built for decode. Groq's LP30 has 500 MB of SRAM at 150 TB/s bandwidth per chip — orders of magnitude more bandwidth per byte of storage than HBM. Cerebras' WSE-3 has 44 GB of on-chip SRAM at 21 PB/s. Standard HBM GPUs like the B200 also serve as decode hardware in software-level disaggregation.

What is the KV cache transfer problem in disaggregated inference?

When prefill and decode run on separate hardware, the KV cache (intermediate values computed during prefill) must transfer over the network to the decode cluster. This transfer can account for up to 42% of total job completion time on low-bandwidth networks. Production systems mitigate this with high-bandwidth fabrics (400 Gbps+), co-locating prefill and decode within the same rack, or using tiered KV cache storage.

Bridging GPU operators and financing partners

We help emerging neoclouds find financing partners, and help financing partners enhance story credit with GPU collateral management and residual value insurance solutions.

Learn how it works →

ShareLinkedIn X Facebook