Back to Technical Guides
AI Software Intermediate

KV Cache: The Hidden Memory Wall in LLM Inference

Why long context is expensive: the math of the KV cache, the architectural moves that shrink it (MQA, GQA, MLA, paged attention), and why 1M-context models are an engineering problem before they're a research problem.

S5 Labs Team May 12, 2026

A 70-billion-parameter model with a 100,000-token prompt does not run out of compute. It runs out of memory. Specifically, at standard dense multi-head attention, a single such request asks the GPU to hold roughly 262 GB of cached keys and values — more than three times the 80 GB of memory on an H100. Push the same request out to a 1-million-token context and the cache balloons to 2.6 TB. The model weights themselves, at FP16, are a comparatively quaint 140 GB and they do not grow with the conversation.

This is the KV cache, and it is the single most important reason “long context” has been hard. Almost every architectural innovation of the past three years that gets sold as a context-length breakthrough — multi-query attention, grouped-query attention, multi-head latent attention, sliding windows, paged attention, KV quantization — is in fact an engineering response to this one piece of memory state. The math is plain. The consequences run from API pricing to GPU procurement.

To be fair to mainstream 70B deployments: Llama 3 70B does not actually pay the 262 GB bill. It uses grouped-query attention with 8 KV heads instead of 64, which cuts the cache by a factor of eight. The 100K-context cost there is closer to 33 GB — still enormous, just survivable. We’ll come back to why GQA became standard. First, the loop that makes the cache necessary in the first place.

The autoregressive loop, in one paragraph

A decoder-only transformer generates text one token at a time. For the next token, the model needs to attend back to every prior token in the sequence — that’s how “the cat sat on the” can predict “mat.” The naive thing to do is recompute, on every generation step, the keys (K) and values (V) for the whole prior sequence from scratch. That is an O(n²) mountain of wasted work, because those K and V tensors don’t change once a token has been seen. The fix, named once you see the problem, is to cache them.

What the KV cache actually is

Inside each attention layer, every token gets projected into three tensors: a query Q, a key K, and a value V. The query asks “what should I attend to?”; the keys and values together represent what each prior token has to offer. The K and V tensors for a given token, at a given layer, are pure functions of that token’s content and position — they do not depend on what comes after.

So the cache, mechanically, is simple. For each layer, for each KV head, for every token already in the sequence, you keep the K and V vectors in GPU memory. When the next token arrives, the model only needs to compute its own Q, K, and V, then attend the new Q against the entire stored K and V tensors. The new K and V get appended to the cache for next round. No prior work is repeated. The cost is that you are now paying memory linearly in sequence length and concurrency.

A practical illustration:

prefill (prompt of N tokens):
  compute K, V for all N tokens, all layers, all KV heads → store in cache

decode (one new token per step):
  compute Q, K, V for current token only
  attend Q against stored K, V (size grows by one each step)
  append new K, V to the cache
  emit token, repeat

The math of cache size

The per-token cache cost has a clean closed form. For a decoder-only model:

KV bytes per token=2Lnkvdheadbytes per element\text{KV bytes per token} = 2 \cdot L \cdot n_{kv} \cdot d_{head} \cdot \text{bytes per element}

Source: KV_bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_param

The factor of 2 is the K and V tensors. L is the number of transformer layers. nkv is the number of KV heads — not necessarily the number of attention heads, which is the GQA wrinkle we’ll get to. dhead is the per-head dimension. The element size is 2 bytes for FP16 or BF16, 1 for INT8, half a byte for INT4.

Plug in a Llama-style 70B model with standard multi-head attention: 80 layers, 64 heads, 128 dim per head, FP16.

280641282=2,621,440 bytes2.62 MB per token2 \cdot 80 \cdot 64 \cdot 128 \cdot 2 = 2{,}621{,}440 \text{ bytes} \approx 2.62 \text{ MB per token}

That’s 2.62 MB for a single token. Multiply by the context length and you get the per-request bill:

ContextDense MHA (n_kv=64)GQA (n_kv=8)
8K21.0 GB2.62 GB
32K83.9 GB10.5 GB
128K335.5 GB41.9 GB
1M2.62 TB328 GB

The second column is the version a real Llama 3 70B deployment pays, because Llama 3 collapses 64 attention heads down to 8 KV heads (eight queries per shared K/V group). Everything in this article hinges on that ratio.

For the rest of the piece, assume FP16 and a 70B-class architecture unless stated otherwise. The key point is not the exact number but that the term grows linearly in context length, linearly in batch size, and proportionally in whatever KV-sharing scheme the architecture chose.

Why this is the memory wall

The model weights for a 70B FP16 model are about 140 GB — and that’s it. The weights don’t change between requests, they don’t grow with the conversation, and they can be shared across every user the GPU is serving. They’re a one-time cost.

The KV cache is the opposite. It scales with every extra token in every concurrent request. Two users with 100K-token contexts on a dense MHA 70B model would notionally need over half a terabyte of cache between them. Even on Llama 3 70B’s GQA, ten simultaneous 32K-context requests cost more memory in cache than the model weights themselves.

This is why “memory wall” is a fairer description of long-context economics than “compute wall.” Compute scales with FLOPs, which scale with how much the GPU is working. The cache scales with how much state the GPU is holding still — and on an H100 with 80 GB of HBM3 and 3.35 TB/s of bandwidth, holding state is the constraint that bites first. To be precise, the bottleneck is sharpest during decode, where each new token requires loading the entire growing cache to attend against. Prefill is more compute-bound; high-concurrency long-context serving is where the cache makes itself felt.

The architectural fixes

There are two distinct categories of intervention here, and conflating them is the single most common mistake in this space. Some techniques reduce the size of the cache itself; others manage a fixed cache more efficiently. Both matter. They are not interchangeable.

Techniques that shrink the cache

MQA — Multi-Query Attention. Shazeer’s 2019 paper named the bottleneck explicitly as “the memory-bandwidth cost of repeatedly loading the large ‘keys’ and ‘values’ tensors” and proposed the simplest possible fix: every attention head shares a single K and V. With one KV head instead of h, the cache shrinks by a factor of h. The cost is quality — the paper itself reports a small but real degradation, and follow-up benchmark work confirms it isn’t free for all tasks. Use it where decode latency is everything and a quality regression of a percentage point is acceptable.

GQA — Grouped-Query Attention. Ainslie et al., 2023 split the difference. Instead of one shared K/V or h of them, you have g groups of heads, each sharing a K/V. With h = 64 and g = 8 — the Llama 3 70B configuration — you get an 8× cache reduction and quality almost indistinguishable from full MHA after a short uptrain. This is why GQA, not MQA, became the default. It is the part of Llama 2, Llama 3, Mistral, and most production-grade 70B-class models that does the most quiet work.

MLA — Multi-head Latent Attention. DeepSeek-V2 introduced MLA, which compresses K and V into a low-rank latent representation of dimension dc per token (roughly 4 × dhead in V2), then reconstructs heads from that latent at attention time. The V2 paper reports a 93.3% KV-cache reduction and 5.76× maximum generation throughput against its baseline. Those numbers are paper-specific to the V2 comparison setup — don’t extrapolate them as a universal MLA constant — but the qualitative point holds: latent compression is the deepest cut into per-token bytes that current production models actually use.

Sliding window attention. Used in Mistral 7B and several follow-ups. Each token only attends to the prior W tokens (Mistral 7B uses W = 4096), and tokens older than W are evicted from the cache. The cache becomes O(W) instead of O(n). The widely-repeated “but the model forgets everything beyond the window” framing is wrong: because attention is stacked across layers, information from earlier tokens propagates through the residual stream layer by layer, so tokens outside the window can still influence the next prediction — just indirectly. The honest cost is that hard long-range dependencies, like answering a question about something explicitly mentioned 100K tokens earlier, degrade.

KV cache quantization. Storing K and V in INT8 or INT4 instead of FP16 is a straightforward 2–4× memory savings. Results from KIVI report 2.6× less peak memory, up to 4× larger batch sizes, and 2.35–3.47× throughput at near-unchanged quality. More aggressive work like KVQuant pushes to 3-bit storage with under 0.1 perplexity degradation on the tested models, and TurboQuant (ICLR 2026) claims quality-neutral 3.5-bit storage. The practical caveat is workload-specific: quantized KV holds up well on QA and perplexity benchmarks but needs validation on whatever workload you actually plan to run.

Techniques that manage the cache better

Paged Attention (vLLM). Kwon et al., 2023 treated the KV cache like an operating-system page table: instead of allocating a contiguous block of GPU memory for each request’s worst-case context length, the cache is split into fixed-size blocks that get allocated on demand and freed when a request finishes. The paper reports near-zero waste in KV-cache memory and 2–4× throughput at similar latency — purely from eliminating fragmentation and enabling block-level sharing between requests that share a prefix (e.g., a common system prompt). PagedAttention does not change the cache at the model level; the same logical cache is just managed less wastefully.

Prompt caching (Anthropic, OpenAI APIs). This is the same idea, exposed at the API surface. When a request shares a prefix with an earlier request, the provider can reuse the cached K and V tensors from that prefix instead of recomputing them. OpenAI’s extended prompt caching documentation describes it as “offloading the key/value tensors to GPU-local storage,” with a 1024-token minimum and exact-prefix matching. Anthropic’s docs make the economics explicit: cache writes carry a markup (1.25× or 2× the base input price depending on TTL), but cache reads are charged at roughly 10% of the base. That asymmetric pricing is exactly what KV reuse looks like in your invoice.

Two takeaways. First: paged attention and prompt caching are not shrinking the cache, they are sharing or reusing it. Stack them with MQA/GQA/MLA/quantization for the multiplier. Second: prompt caching is only economical if you actually get cache hits — repetitive system prompts, long retrieved documents reused across queries, agent loops with stable scaffolding. One-shot conversations pay the cache-write markup with no return.

The DeepSeek line, and what V4 is really doing

DeepSeek’s KV-cache work has been the clearest public arc on this problem. V2 introduced MLA, which sets the floor at roughly 10× compression over dense MHA. V3 kept MLA and pushed the routing side of the problem. V4, announced in early 2026, layered on CSA (Compressed Sparse Attention) and HCA (Hybrid Compressed Attention) — both, in essence, are about making the access pattern over the latent cache sparse and structured, not just making the cache smaller.

The headline number coming out of the V4 disclosure — “27% of V3.2 per-token FLOPs and 10% of V3.2 KV cache at 1M context” — is, almost in its entirety, KV-cache and attention-pattern engineering. The decoder is doing similar work to V3 on most tokens; it’s just doing far less of it against the past. We’ll have a dedicated CSA/HCA breakdown in a separate technical article, but the broad point matters here: the frontier of “long context” research at the architecture level is now overwhelmingly the frontier of how cleverly you can avoid touching the cache.

The hardware side is making the same bet. Google’s TPU v8i disclosure at Cloud Next 2026 emphasized 3× more on-chip SRAM alongside larger HBM, with explicit framing around keeping more of the KV cache on silicon rather than reaching out to HBM for every decode step. When a major chip vendor tells you their next inference accelerator has 3× the SRAM, the unspoken second half of the sentence is “because of the KV cache.”

Practical implications for builders

The math turns into a few concrete consequences worth keeping front of mind.

Concurrency math for a single H100. An 80 GB H100 has to fit the model weights and the KV cache for every concurrent request, plus some headroom for activations and overhead. On a Llama 3 70B with FP16 weights (~140 GB), you don’t even fit the model on one H100 — you need at least two, leaving roughly 20 GB of working memory across the pair after weights. At a GQA cache cost of ~33 GB per 100K-context request, you’re looking at fractional simultaneous 100K requests on that hardware. Drop to 32K context (10.5 GB per request) and you fit one or two. Quantize the weights to INT4 (~35 GB) and the KV cache to INT8, and the picture changes dramatically. This kind of back-of-envelope math should run before a capacity plan, not after a benchmark surprises you.

1M-context pricing isn’t linear, and it shouldn’t be. Providers charging a flat per-token rate at extreme context lengths are eating a higher marginal cost the further out you go — every additional token in the prompt costs not just the per-token compute but the proportional KV-cache memory and the bandwidth to move it. Expect tiered pricing, expect long-context surcharges, and expect the gap between advertised context windows and cost-effective context windows to widen.

Prompt caching pays off when prefixes are stable and long. The 10× discount on cache reads is real, and on workflows with repeated long system prompts or retrieved documents that recur across requests, it can dominate cost. On one-off queries, the cache-write markup is dead money. The mental model is: prompt caching is just KV reuse exposed at the API layer — you’re paying once to populate, then renting cheap reads.

Local 7B models with 128K context are slower than you’d expect, and the cache is why. A 7B model is small enough that a single consumer GPU should run it briskly, but at 128K context the KV cache (even with GQA and quantization) becomes the dominant memory term, and bandwidth, not FLOPs, governs decode speed. The model isn’t slow; the GPU is spending most of its decode time reading state.

What’s next

Three directions are visibly displacing pure cache management.

Block-sparse and hardware-aligned attention. Native Sparse Attention and follow-up work redesign attention itself to read sparse, hardware-aligned slices of the cache instead of attending densely over the full sequence. The cache exists; the model just stops touching most of it. This is the most likely near-term replacement for sliding windows.

Retrieval-augmented context as a KV-cache alternative. A 1M-token prompt is a brute-force way to give a model access to a large corpus. A well-tuned RAG pipeline that retrieves the right 8K tokens and stuffs them into a 32K context is almost always cheaper and often more accurate. The trade-off is real: retrieval introduces its own error surface, and benchmarks like LongBench v2 make clear that retrieval-style success and true long-context reasoning success diverge as length grows. Treat RAG as the right call for knowledge access, not as a full substitute for long-range reasoning.

State-space models that side-step the wall entirely. Mamba and its descendants replace attention with a recurrent state that has constant memory per step regardless of sequence length. There is no KV cache to grow because there is no attention over the past — there’s a fixed-size compressed state that gets updated each token. The trade-off is harder global recall, but for long-form serialized workloads (audio, video, long code), the cost profile is qualitatively different. Several frontier labs are training SSM-attention hybrids that try to keep attention’s reasoning quality while inheriting Mamba’s flat memory profile.

The same memory term shows up on the parameter side too, where mixture-of-experts architectures decouple total parameters from activated parameters. Long-context inference is, increasingly, two memory problems running at once: the KV cache (state) and the expert weights (parameters). The frontier labs are solving both at the same time.

The deeper point — older than KV caching, older than transformers, older than deep learning — is that the bottleneck migrates. A decade ago, compute was scarce and memory was plentiful relative to model size. Today, FLOPs are cheap and what you can’t get is bandwidth to a large enough cache fast enough. Whatever the next architecture looks like, it will be designed against the bandwidth and the memory it needs to not touch. Long context is, in the end, less a model problem than a memory-system problem. Build accordingly.

Sources

  • Vaswani et al., Attention Is All You Needarxiv 1706.03762
  • Shazeer, Fast Transformer Decoding: One Write-Head Is All You Needarxiv 1911.02150
  • Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpointsarxiv 2305.13245
  • DeepSeek-AI, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelarxiv 2405.04434
  • Jiang et al., Mistral 7Barxiv 2310.06825
  • Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttentionarxiv 2309.06180
  • Liu et al., KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cachearxiv 2402.02750
  • Hooper et al., KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantizationarxiv 2401.18079
  • Yuan et al., Native Sparse Attentionarxiv 2502.11089
  • Gu and Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spacesarxiv 2312.00752
  • Bai et al., LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasksarxiv 2412.15204
  • Anthropic, Prompt cachingdocs.anthropic.com
  • OpenAI, Prompt cachingplatform.openai.com

Want to discuss this topic?

We'd love to hear about your specific challenges and how we might help.