Back to Technical Guides
AI Software Advanced

Compressed Sparse Attention: How DeepSeek V4 Reached 1M Context at 27% of the FLOPs

DeepSeek V4 hits 1M context at 27% of V3.2's per-token compute. How Compressed Sparse Attention and Heavily Compressed Attention combine to do it.

S5 Labs Team May 13, 2026

DeepSeek’s V4 technical report makes a specific, falsifiable claim: at a 1-million-token context, V4-Pro performs a single decode step at 27% of the per-token FLOPs of V3.2, and uses 10% of the KV cache. V4-Flash, the smaller sibling, lands at 10% of the FLOPs and 7% of the cache. The headline number that floated through press coverage — “frontier-class at a tenth the cost” — is, almost entirely, this number. It is not pretraining efficiency, not chip-level wizardry, not a new MoE routing trick. It is what the attention layer is doing differently.

This article unpacks that “differently.” The mechanism has two parts, which DeepSeek calls Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). Press coverage and the V4 preview page sometimes refer to the package as “token-wise compression plus DSA”; the technical report breaks it cleanly into the two named components, and we’ll use the report’s terminology throughout. The lineage runs MLA → DSA → CSA/HCA across four DeepSeek releases, and the story of V4 is not one clever idea but the third pass at a problem the lab has been compounding on since 2024.

We assume the reader knows what attention is. If you want the prerequisites, foundations of transformer reasoning covers the standard mechanism and the KV cache article covers the memory side of long-context inference. CSA/HCA is what you do once both of those constraints have been named.

Two costs, not one

Long-context inference has two enemies, and most prior “efficient attention” work attacks only one.

The first is compute. Standard attention is

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V

Source: Attention(Q,K,V) = softmax(Q K^T / sqrt(d_k)) V

The QKQK^{\top} term is the expensive one. For a sequence of length nn, it produces an n×nn \times n similarity matrix — O(n2d)O(n^2 d) work at the layer level. At 8K context, that’s manageable. At 1M, it is the dominant term in the entire forward pass.

The second is memory. For autoregressive decoding, every prior token’s K and V vectors must stay in GPU memory so the next token can attend back. That cost is O(n)O(n) per request per layer, and on a 70B-class model it dominates GPU memory well before nn approaches 1M. The KV cache article walks the math.

Sparse-attention research from 2019–2021 — Sparse Transformer, Longformer, BigBird — attacked the compute side by making each query attend to only a fixed subset of keys, with patterns like local windows plus a few global tokens. Linear-attention work — Linformer, Performer — attacked it by approximating softmax with low-rank or kernel-based reductions. Neither line reduced the cache; both struggled to match dense attention on long-range reasoning benchmarks once you trained at scale.

DeepSeek’s contribution across V2, V3.2, and V4 is that the lab has, in three steps, attacked both costs in a way that survives full-scale pretraining and frontier-competitive downstream evaluation. CSA/HCA is the third step.

The lineage: MLA → DSA → CSA/HCA

The most common framing error in V4 coverage is to describe it as “MLA but better.” MLA is the V2/V3 mechanism. V4 inherits MLA’s idea — compress K and V into a low-rank latent space — and then layers two new mechanisms on top. Between V3 and V4 sits V3.2, which introduces the sparsity half of the package. Skipping that step makes the rest of the story incoherent.

MLA in V2/V3

Multi-head Latent Attention, introduced in DeepSeek-V2, projects the K and V tensors into a shared low-rank latent of dimension dcd_c (roughly 4dhead4 d_{\text{head}} in V2), caches that latent, and reconstructs per-head K and V at attention time from the cached latent plus a small per-head matrix. The V2 paper reports a 93.3% KV-cache reduction and 5.76× generation throughput against its baseline.

The win is purely on cache size. Every cached token still participates in every attention computation — the cache is just smaller per token. Compute per query is unchanged. This is a compression technique, not a sparsity technique.

DSA in V3.2

V3.2-Exp introduced DeepSeek Sparse Attention (DSA). This is the bridge that gets skipped in most press coverage and is the missing rung in the lineage. DSA adds selection to MLA: rather than attending over the entire compressed-latent cache, each query attends only to a learned subset of past tokens. The set is chosen per-query at inference time — not by fixed pattern, but by a small learned router that decides which prior positions are relevant to this position.

This is the conceptually expensive move. Learned sparsity is what fixed-pattern sparse attention (Longformer, BigBird) tried and failed to scale: it does not commit to a topology in advance, and the model gets to spend its sparsity budget on whichever tokens matter for this input. The historical lesson from sparse-attention research is that fixed patterns lose information predictably — long-range dependencies fall outside the pattern, and quality degrades. DSA’s per-query selection avoids that failure mode at the cost of needing a learned selector that itself runs in sub-quadratic time.

The closest research precursor is Native Sparse Attention (NSA), published in early 2025 by authors overlapping with DeepSeek’s V3.2 team. NSA’s specific contribution was hardware-aligned sparsity: kernels that read structured-sparse slices of the cache as fast as dense kernels read contiguous memory. Without that, learned sparsity is theoretically efficient and practically slower than the dense baseline, because random scatter-gather kills GPU memory bandwidth. NSA is the bridge from “this works on paper” to “this runs on H800s at training scale.”

CSA and HCA in V4

V4 productionizes the lineage. The attention stack is heterogeneous: some layers run Compressed Sparse Attention (CSA), others run Heavily Compressed Attention (HCA).

CSA is essentially DSA with the compression more deeply integrated: token-wise compression of K and V into a latent, learned per-query selection of which compressed entries to attend over, and hardware-aligned kernels inherited from the NSA line. Each CSA query reads a small structured-sparse slice of the cache, both shrunk (compression) and short (sparsity).

HCA is the more aggressive variant. Where CSA reduces compute and memory by both axes, HCA leans hard on compression — the per-token latent is squeezed further, with correspondingly larger reconstruction overhead per attended token. The trade-off is that HCA layers can afford to attend over a denser slice of the past because each entry is so small. CSA layers are sparse-but-medium-compressed; HCA layers are denser-but-heavily-compressed. Both end up cheaper than dense MLA, and they make different trade-offs in the compute/memory plane.

The hybrid scheduling — which layer runs which — is the part of V4 that does not generalize without the paper’s specific recipe. The published inference config suggests early layers lean denser (the model is still building its representation) and deeper layers lean sparser (most retrieval is local once enough context has been built), but treating this as architectural law would overclaim. Read the V4 report for the exact policy.

A clean way to hold this in your head: MLA shrunk the cache. DSA added selection. CSA combined them. HCA pushed compression further so some layers could afford to be less sparse. Each step inherits and refines the previous.

Unpacking the 27% claim

The “27% of V3.2 FLOPs at 1M context” number is doing a lot of work in the V4 narrative, and it’s worth being precise about where the 73% savings come from.

Attention compute scales as O(nsd)O(n \cdot s \cdot d) per query, where ss is the effective set size attended over and dd is the effective per-entry dimensionality. For dense full attention, s=ns = n and d=dheadnkvd = d_{\text{head}} \cdot n_{\text{kv}} (the per-token cache footprint times number of KV heads). For DeepSeek’s stack:

Layer typeEffective ssEffective ddCompute term
Full MHA (hypothetical baseline)nndheadnheadsd_{\text{head}} \cdot n_{\text{heads}}O(n2dheadnheads)O(n^2 \cdot d_{\text{head}} \cdot n_{\text{heads}})
MLA (V2/V3)nndcd_c (low-rank latent)O(n2dc)O(n^2 \cdot d_c)
CSA (V4)knk \ll n (learned)dcd_c' (compressed)O(nkdc)O(n \cdot k \cdot d_c')
HCA (V4)n\sim ndcdcd_c'' \ll d_cO(n2dc)O(n^2 \cdot d_c'')

CSA cuts the dominant n2n^2 term to nkn \cdot k by making each query attend to only kk selected positions instead of all nn. HCA keeps the n2n^2 but slashes the per-entry constant. The interleaving means the model only pays the full n2n^2 cost on the heavily-compressed layers — and pays a much smaller nkn \cdot k cost everywhere else.

This is the structural reason 27% (and not, say, 50% or 5%) is plausible. At 1M context, nkn \cdot k for sparse layers is roughly k/nk/n times the dense cost — if k0.05nk \approx 0.05n, those layers cost about 5% of dense. The HCA layers, attending nearly full but at much lower dd, cost perhaps 20–30% of dense. Weight them across all 61 layers and you land somewhere in the 25–30% band. The exact decomposition depends on the layer schedule, which the V4 paper specifies; we’re walking the order-of-magnitude story, not reproducing the report.

One more important asymmetry: the 27% number is at 1M context. At shorter contexts, the O(n2)O(n^2) term doesn’t dominate, so the savings are smaller. The V4 report’s framing is honest about this — the gain compounds with sequence length, which is precisely the regime DeepSeek is pricing into. At 8K context, V4 is not 4× cheaper than V3.2 per token; it is similar. Long context is where this architecture pays off.

The KV cache side mirrors the compute side. V4-Pro’s 10% cache vs. V3.2 at 1M is the combined effect of (a) tighter latent compression in HCA layers, (b) reduced per-token storage in CSA layers because the compressed representation is smaller, and (c) the absence of any cache duplication across heads (already inherited from MLA). V4-Flash’s 7% is the same recipe with more aggressive layer-level compression.

Why this is hard to do without quality loss

The history of efficient attention is largely a history of papers that beat dense attention on a benchmark in a specific regime and then failed to scale. Linformer, Performer, Reformer, sliding-window attention, Longformer, BigBird — all are useful, none replaced dense attention at frontier scale, and the reason is the same in each case: uniform information loss. A fixed sparsity pattern throws away the same kinds of relationships every time. A low-rank approximation throws away the same kinds of features every time. The model can’t route around the deficit because the deficit is baked into the architecture.

V4’s design responds to this in three ways.

Sparsity is per-input, not architectural. CSA’s selector is a learned, query-conditioned router. If the next token depends on a passage 800K tokens earlier, the selector can route attention there for that query specifically — without having paid quadratic cost to find it, because the selector itself runs sub-quadratically. This is the conceptual generalization of fixed sparsity that earlier work could not implement at scale, and the gap between “should work in principle” and “actually runs at frontier scale” is where most of NSA’s engineering went.

Compression keeps a slow lane. HCA layers are aggressive; CSA layers are less so. Information that would be lost by uniform heavy compression survives in the intermediate-compression layers, where the model can re-expand into richer representations as needed. The architecture has a fast path and a slow path, in the same sense that an OoO CPU has bypass networks for hot data.

Training is long-context from the start. Earlier sparse-attention work was often retrofitted onto models pretrained on short contexts, where the sparsity pattern is unstressed and the selector never has to do hard work. V4 trains with long-context curricula throughout, so the selector and the compression both learn under the regime they’ll be deployed in. This sounds obvious in 2026 and was rare in 2020.

None of these is novel in isolation. The novelty is that all three hold together at 1.6T parameters and 1M context — and that, as the V4 report and downstream benchmarks show, the resulting model lands within a few points of GPT-5.5 and Claude Opus 4.7 on most reasoning tasks. The benchmark wins are not universal. V4 is below GPT-5.5 on SWE-Bench Verified (~80.6% vs 82.7%) and below Claude Opus 4.7 on some long-context retrieval tasks. The story is parity-at-fraction-of-cost, not new state of the art. See the V4 cost-disruption note for the full benchmark picture.

The competitive context

Other long-context models in 2026 are not standing still, and it’s worth grounding CSA/HCA against what else is in the field.

Gemini 3 Pro offers 1M context with an undisclosed attention mechanism. Google has not published architecture details for Gemini’s long-context path; the published pricing — $2/$12 per 1M tokens up to 200K, $4/$18 above — suggests internal cost reflects a meaningful long-context surcharge, which is consistent with some form of sparse or compressed attention but is not direct evidence of any specific mechanism.

GPT-5.5 and Claude Opus 4.7 ship shorter context windows (200K–400K) with attention that is, by all available reporting, denser than DeepSeek’s. The prioritization is the opposite: keep attention rich for the quality ceiling, accept that the context window can’t stretch to 1M without prohibitive cost. Anthropic’s extended-context guidance and OpenAI’s tool-augmented workflows both push users toward retrieval and tool use for very large corpora rather than dumping everything into prompt.

Mamba and state-space models sidestep attention entirely, achieving O(n)O(n) inference with a constant-size compressed state. Quality has trailed transformer attention on benchmarks requiring precise global recall, which is the same failure mode that uniform attention compression hits. Hybrid SSM-attention models — Jamba, Zamba, the Mamba-2 lineage — are the active research frontier here, and they share a deep similarity with CSA/HCA: heterogeneous layer types, each with different memory/compute profiles.

Moonshot’s Kimi K2.5 is the closest analog to V4 in shape — large open-weight MoE with extended context, trillion-parameter scale, aggressive pricing. The two labs are on similar architectural arcs, and the Chinese open-weight ecosystem appears to be converging on hybrid compressed-sparse attention as the long-context default.

The one-line summary: in 2026, every credible long-context model has either disclosed or strongly implied some form of compressed plus sparse attention. The closed labs aren’t publishing; the open labs are. V4 is the most legible disclosure to date.

What still doesn’t work at 1M context

A measured technical article must also say what V4 does not solve. The 27% number is real and the 1M context is real. The benchmarks within those contexts are a different story.

Long-context evaluation benchmarks — RULER, LongBench v2, CorpusQA — consistently show that advertised context windows and effective reasoning windows diverge as length grows. Models that score well at single-needle retrieval at 1M tokens often score much worse at multi-hop reasoning or distributed evidence synthesis at the same length. The V4 report’s own numbers reflect this: MRCR at 1M scores well but is far from human-level, and CorpusQA at 1M shows the gap between “find this fact” and “reason across this corpus.”

The practical reading: V4 makes 1M context affordable, not solved. The right workloads for 1M-context V4 are single-shot synthesis tasks (legal review of a 500-page contract, code analysis across a 200K-line repo, document QA over a large brief) where a single high-recall pass is enough. Multi-hop reasoning across millions of tokens still degrades, and the right architecture for that is some combination of agent loops, retrieval, and incremental synthesis — not one giant prompt.

There’s also the latency tax. The asymptotic compute story is good; the wall-clock story has constants. Time-to-first-token at 1M context on V4 is measured in seconds, not milliseconds, because even at sparse kk, the prefill phase must process every input token through the model. Sparse attention reduces the compute per pair; it doesn’t reduce the linear cost of seeing every token at least once. Plan for it.

Implications for builders

Three concrete moves are worth making if you work on systems that consume long-context inference.

The 1M-context tier is now commodity-priced. Workloads that previously needed retrieval to fit in a smaller context window can, in many cases, just be stuffed into V4’s prompt directly. This is not always the right call — retrieval is still cheaper per query, often more accurate for fact lookup, and easier to debug — but the cost gap has compressed enough that the decision is now per-workload, not architectural. The right question is “does retrieval cost less than just paying for the long context here,” not “do I have to use retrieval to make this feasible.”

Verify recall at the contexts you’ll actually use. Don’t trust the 1M-context marketing without your own needle-in-haystack and multi-hop tests on your data. The model’s recall at 8K is excellent. At 1M, it is workload-specific and benchmark-specific. The honest evaluation budget for an enterprise long-context deployment is 1–2 weeks of red-teaming on representative tasks, not an afternoon of casual prompts.

Expect this architecture in every open-weight successor. CSA/HCA is now reproducible — the weights are open, the inference config is published, and the research lineage (MLA, NSA) is fully documented in papers. The next Llama, Qwen, Mistral, and Kimi releases will almost certainly include some flavor of compressed-plus-sparse attention. The “1M-context for cheap” capability is no longer a moat; it is becoming a baseline. Architectural decisions made now in production systems should assume this floor.

What V4 changes about the field

The deeper point is that attention design has become the load-bearing innovation in long-context inference. For a long stretch — roughly 2020 through 2023 — the dominant axis of model improvement was parameter count: scale up, train longer, see what emerges. That story didn’t stop, but it slowed: the marginal benchmark gain from going from 1T to 1.6T total parameters is small. The marginal benchmark gain from making 1M context affordable to serve is large. So the field has moved from “scale parameters” to “scale parameters cheaply and scale context cheaply,” and that second clause is where attention engineering lives.

V4 is not the final word — it is the third visible step in DeepSeek’s attention arc and the first one that ships at production scale with open weights. The hardware side is converging on the same bet: TPU v8i’s 3× SRAM increase, NVIDIA’s GB200 unified-memory architecture, and the broader move toward keeping more of the KV cache in fast memory all read as the silicon-level corollary to what CSA/HCA does at the model level. Whatever V5 looks like, it will inherit the same constraint: bandwidth and memory, not raw FLOPs, are what gate the next order-of-magnitude improvement in long-context serving. The architecture follows.

For frontier-watchers, the takeaway is that the next major long-context paper from any lab — closed or open — should be read against the V4 baseline. “1M context, frontier-class quality, sub-30% of V3.2 compute” is the bar to beat now. It was not the bar three months ago.

Sources

  • DeepSeek-AI, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelarxiv 2405.04434
  • Yuan et al., Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attentionarxiv 2502.11089
  • Vaswani et al., Attention Is All You Needarxiv 1706.03762
  • Child et al., Generating Long Sequences with Sparse Transformersarxiv 1904.10509
  • Beltagy et al., Longformer: The Long-Document Transformerarxiv 2004.05150
  • Zaheer et al., Big Bird: Transformers for Longer Sequencesarxiv 2007.14062
  • Wang et al., Linformer: Self-Attention with Linear Complexityarxiv 2006.04768
  • Choromanski et al., Rethinking Attention with Performersarxiv 2009.14794
  • Gu and Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spacesarxiv 2312.00752
  • Bai et al., LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasksarxiv 2412.15204
  • Simon Willison, DeepSeek V4simonwillison.net, April 24 2026

Want to discuss this topic?

We'd love to hear about your specific challenges and how we might help.