DeepSeek’s V4 technical report makes a specific, falsifiable claim: at a 1-million-token context, V4-Pro performs a single decode step at 27% of the per-token FLOPs of V3.2, and uses 10% of the KV cache. V4-Flash, the smaller sibling, lands at 10% of the FLOPs and 7% of the cache. The headline number that floated through press coverage — “frontier-class at a tenth the cost” — is, almost entirely, this number. It is not pretraining efficiency, not chip-level wizardry, not a new MoE routing trick. It is what the attention layer is doing differently.
This article unpacks that “differently.” The mechanism has two parts, which DeepSeek calls Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). Press coverage and the V4 preview page sometimes refer to the package as “token-wise compression plus DSA”; the technical report breaks it cleanly into the two named components, and we’ll use the report’s terminology throughout. The lineage runs MLA → DSA → CSA/HCA across four DeepSeek releases, and the story of V4 is not one clever idea but the third pass at a problem the lab has been compounding on since 2024.
We assume the reader knows what attention is. If you want the prerequisites, foundations of transformer reasoning covers the standard mechanism and the KV cache article covers the memory side of long-context inference. CSA/HCA is what you do once both of those constraints have been named.
Two costs, not one
Long-context inference has two enemies, and most prior “efficient attention” work attacks only one.
The first is compute. Standard attention is
Source: Attention(Q,K,V) = softmax(Q K^T / sqrt(d_k)) V
The term is the expensive one. For a sequence of length , it produces an similarity matrix — work at the layer level. At 8K context, that’s manageable. At 1M, it is the dominant term in the entire forward pass.
The second is memory. For autoregressive decoding, every prior token’s K and V vectors must stay in GPU memory so the next token can attend back. That cost is per request per layer, and on a 70B-class model it dominates GPU memory well before approaches 1M. The KV cache article walks the math.
Sparse-attention research from 2019–2021 — Sparse Transformer, Longformer, BigBird — attacked the compute side by making each query attend to only a fixed subset of keys, with patterns like local windows plus a few global tokens. Linear-attention work — Linformer, Performer — attacked it by approximating softmax with low-rank or kernel-based reductions. Neither line reduced the cache; both struggled to match dense attention on long-range reasoning benchmarks once you trained at scale.
DeepSeek’s contribution across V2, V3.2, and V4 is that the lab has, in three steps, attacked both costs in a way that survives full-scale pretraining and frontier-competitive downstream evaluation. CSA/HCA is the third step.
The lineage: MLA → DSA → CSA/HCA
The most common framing error in V4 coverage is to describe it as “MLA but better.” MLA is the V2/V3 mechanism. V4 inherits MLA’s idea — compress K and V into a low-rank latent space — and then layers two new mechanisms on top. Between V3 and V4 sits V3.2, which introduces the sparsity half of the package. Skipping that step makes the rest of the story incoherent.
MLA in V2/V3
Multi-head Latent Attention, introduced in DeepSeek-V2, projects the K and V tensors into a shared low-rank latent of dimension (roughly in V2), caches that latent, and reconstructs per-head K and V at attention time from the cached latent plus a small per-head matrix. The V2 paper reports a 93.3% KV-cache reduction and 5.76× generation throughput against its baseline.
The win is purely on cache size. Every cached token still participates in every attention computation — the cache is just smaller per token. Compute per query is unchanged. This is a compression technique, not a sparsity technique.
DSA in V3.2
V3.2-Exp introduced DeepSeek Sparse Attention (DSA). This is the bridge that gets skipped in most press coverage and is the missing rung in the lineage. DSA adds selection to MLA: rather than attending over the entire compressed-latent cache, each query attends only to a learned subset of past tokens. The set is chosen per-query at inference time — not by fixed pattern, but by a small learned router that decides which prior positions are relevant to this position.
This is the conceptually expensive move. Learned sparsity is what fixed-pattern sparse attention (Longformer, BigBird) tried and failed to scale: it does not commit to a topology in advance, and the model gets to spend its sparsity budget on whichever tokens matter for this input. The historical lesson from sparse-attention research is that fixed patterns lose information predictably — long-range dependencies fall outside the pattern, and quality degrades. DSA’s per-query selection avoids that failure mode at the cost of needing a learned selector that itself runs in sub-quadratic time.
The closest research precursor is Native Sparse Attention (NSA), published in early 2025 by authors overlapping with DeepSeek’s V3.2 team. NSA’s specific contribution was hardware-aligned sparsity: kernels that read structured-sparse slices of the cache as fast as dense kernels read contiguous memory. Without that, learned sparsity is theoretically efficient and practically slower than the dense baseline, because random scatter-gather kills GPU memory bandwidth. NSA is the bridge from “this works on paper” to “this runs on H800s at training scale.”
CSA and HCA in V4
V4 productionizes the lineage. The attention stack is heterogeneous: some layers run Compressed Sparse Attention (CSA), others run Heavily Compressed Attention (HCA).
CSA is essentially DSA with the compression more deeply integrated: token-wise compression of K and V into a latent, learned per-query selection of which compressed entries to attend over, and hardware-aligned kernels inherited from the NSA line. Each CSA query reads a small structured-sparse slice of the cache, both shrunk (compression) and short (sparsity).
HCA is the more aggressive variant. Where CSA reduces compute and memory by both axes, HCA leans hard on compression — the per-token latent is squeezed further, with correspondingly larger reconstruction overhead per attended token. The trade-off is that HCA layers can afford to attend over a denser slice of the past because each entry is so small. CSA layers are sparse-but-medium-compressed; HCA layers are denser-but-heavily-compressed. Both end up cheaper than dense MLA, and they make different trade-offs in the compute/memory plane.
The hybrid scheduling — which layer runs which — is the part of V4 that does not generalize without the paper’s specific recipe. The published inference config suggests early layers lean denser (the model is still building its representation) and deeper layers lean sparser (most retrieval is local once enough context has been built), but treating this as architectural law would overclaim. Read the V4 report for the exact policy.
A clean way to hold this in your head: MLA shrunk the cache. DSA added selection. CSA combined them. HCA pushed compression further so some layers could afford to be less sparse. Each step inherits and refines the previous.
Unpacking the 27% claim
The “27% of V3.2 FLOPs at 1M context” number is doing a lot of work in the V4 narrative, and it’s worth being precise about where the 73% savings come from.
Attention compute scales as per query, where is the effective set size attended over and is the effective per-entry dimensionality. For dense full attention, and (the per-token cache footprint times number of KV heads). For DeepSeek’s stack:
| Layer type | Effective | Effective | Compute term |
|---|---|---|---|
| Full MHA (hypothetical baseline) | |||
| MLA (V2/V3) | (low-rank latent) | ||
| CSA (V4) | (learned) | (compressed) | |
| HCA (V4) |
CSA cuts the dominant term to by making each query attend to only selected positions instead of all . HCA keeps the but slashes the per-entry constant. The interleaving means the model only pays the full cost on the heavily-compressed layers — and pays a much smaller cost everywhere else.
This is the structural reason 27% (and not, say, 50% or 5%) is plausible. At 1M context, for sparse layers is roughly times the dense cost — if , those layers cost about 5% of dense. The HCA layers, attending nearly full but at much lower , cost perhaps 20–30% of dense. Weight them across all 61 layers and you land somewhere in the 25–30% band. The exact decomposition depends on the layer schedule, which the V4 paper specifies; we’re walking the order-of-magnitude story, not reproducing the report.
One more important asymmetry: the 27% number is at 1M context. At shorter contexts, the term doesn’t dominate, so the savings are smaller. The V4 report’s framing is honest about this — the gain compounds with sequence length, which is precisely the regime DeepSeek is pricing into. At 8K context, V4 is not 4× cheaper than V3.2 per token; it is similar. Long context is where this architecture pays off.
The KV cache side mirrors the compute side. V4-Pro’s 10% cache vs. V3.2 at 1M is the combined effect of (a) tighter latent compression in HCA layers, (b) reduced per-token storage in CSA layers because the compressed representation is smaller, and (c) the absence of any cache duplication across heads (already inherited from MLA). V4-Flash’s 7% is the same recipe with more aggressive layer-level compression.
Why this is hard to do without quality loss
The history of efficient attention is largely a history of papers that beat dense attention on a benchmark in a specific regime and then failed to scale. Linformer, Performer, Reformer, sliding-window attention, Longformer, BigBird — all are useful, none replaced dense attention at frontier scale, and the reason is the same in each case: uniform information loss. A fixed sparsity pattern throws away the same kinds of relationships every time. A low-rank approximation throws away the same kinds of features every time. The model can’t route around the deficit because the deficit is baked into the architecture.
V4’s design responds to this in three ways.
Sparsity is per-input, not architectural. CSA’s selector is a learned, query-conditioned router. If the next token depends on a passage 800K tokens earlier, the selector can route attention there for that query specifically — without having paid quadratic cost to find it, because the selector itself runs sub-quadratically. This is the conceptual generalization of fixed sparsity that earlier work could not implement at scale, and the gap between “should work in principle” and “actually runs at frontier scale” is where most of NSA’s engineering went.
Compression keeps a slow lane. HCA layers are aggressive; CSA layers are less so. Information that would be lost by uniform heavy compression survives in the intermediate-compression layers, where the model can re-expand into richer representations as needed. The architecture has a fast path and a slow path, in the same sense that an OoO CPU has bypass networks for hot data.
Training is long-context from the start. Earlier sparse-attention work was often retrofitted onto models pretrained on short contexts, where the sparsity pattern is unstressed and the selector never has to do hard work. V4 trains with long-context curricula throughout, so the selector and the compression both learn under the regime they’ll be deployed in. This sounds obvious in 2026 and was rare in 2020.
None of these is novel in isolation. The novelty is that all three hold together at 1.6T parameters and 1M context — and that, as the V4 report and downstream benchmarks show, the resulting model lands within a few points of GPT-5.5 and Claude Opus 4.7 on most reasoning tasks. The benchmark wins are not universal. V4 is below GPT-5.5 on SWE-Bench Verified (~80.6% vs 82.7%) and below Claude Opus 4.7 on some long-context retrieval tasks. The story is parity-at-fraction-of-cost, not new state of the art. See the V4 cost-disruption note for the full benchmark picture.
The competitive context
Other long-context models in 2026 are not standing still, and it’s worth grounding CSA/HCA against what else is in the field.
Gemini 3 Pro offers 1M context with an undisclosed attention mechanism. Google has not published architecture details for Gemini’s long-context path; the published pricing — $2/$12 per 1M tokens up to 200K, $4/$18 above — suggests internal cost reflects a meaningful long-context surcharge, which is consistent with some form of sparse or compressed attention but is not direct evidence of any specific mechanism.
GPT-5.5 and Claude Opus 4.7 ship shorter context windows (200K–400K) with attention that is, by all available reporting, denser than DeepSeek’s. The prioritization is the opposite: keep attention rich for the quality ceiling, accept that the context window can’t stretch to 1M without prohibitive cost. Anthropic’s extended-context guidance and OpenAI’s tool-augmented workflows both push users toward retrieval and tool use for very large corpora rather than dumping everything into prompt.
Mamba and state-space models sidestep attention entirely, achieving inference with a constant-size compressed state. Quality has trailed transformer attention on benchmarks requiring precise global recall, which is the same failure mode that uniform attention compression hits. Hybrid SSM-attention models — Jamba, Zamba, the Mamba-2 lineage — are the active research frontier here, and they share a deep similarity with CSA/HCA: heterogeneous layer types, each with different memory/compute profiles.
Moonshot’s Kimi K2.5 is the closest analog to V4 in shape — large open-weight MoE with extended context, trillion-parameter scale, aggressive pricing. The two labs are on similar architectural arcs, and the Chinese open-weight ecosystem appears to be converging on hybrid compressed-sparse attention as the long-context default.
The one-line summary: in 2026, every credible long-context model has either disclosed or strongly implied some form of compressed plus sparse attention. The closed labs aren’t publishing; the open labs are. V4 is the most legible disclosure to date.
What still doesn’t work at 1M context
A measured technical article must also say what V4 does not solve. The 27% number is real and the 1M context is real. The benchmarks within those contexts are a different story.
Long-context evaluation benchmarks — RULER, LongBench v2, CorpusQA — consistently show that advertised context windows and effective reasoning windows diverge as length grows. Models that score well at single-needle retrieval at 1M tokens often score much worse at multi-hop reasoning or distributed evidence synthesis at the same length. The V4 report’s own numbers reflect this: MRCR at 1M scores well but is far from human-level, and CorpusQA at 1M shows the gap between “find this fact” and “reason across this corpus.”
The practical reading: V4 makes 1M context affordable, not solved. The right workloads for 1M-context V4 are single-shot synthesis tasks (legal review of a 500-page contract, code analysis across a 200K-line repo, document QA over a large brief) where a single high-recall pass is enough. Multi-hop reasoning across millions of tokens still degrades, and the right architecture for that is some combination of agent loops, retrieval, and incremental synthesis — not one giant prompt.
There’s also the latency tax. The asymptotic compute story is good; the wall-clock story has constants. Time-to-first-token at 1M context on V4 is measured in seconds, not milliseconds, because even at sparse , the prefill phase must process every input token through the model. Sparse attention reduces the compute per pair; it doesn’t reduce the linear cost of seeing every token at least once. Plan for it.
Implications for builders
Three concrete moves are worth making if you work on systems that consume long-context inference.
The 1M-context tier is now commodity-priced. Workloads that previously needed retrieval to fit in a smaller context window can, in many cases, just be stuffed into V4’s prompt directly. This is not always the right call — retrieval is still cheaper per query, often more accurate for fact lookup, and easier to debug — but the cost gap has compressed enough that the decision is now per-workload, not architectural. The right question is “does retrieval cost less than just paying for the long context here,” not “do I have to use retrieval to make this feasible.”
Verify recall at the contexts you’ll actually use. Don’t trust the 1M-context marketing without your own needle-in-haystack and multi-hop tests on your data. The model’s recall at 8K is excellent. At 1M, it is workload-specific and benchmark-specific. The honest evaluation budget for an enterprise long-context deployment is 1–2 weeks of red-teaming on representative tasks, not an afternoon of casual prompts.
Expect this architecture in every open-weight successor. CSA/HCA is now reproducible — the weights are open, the inference config is published, and the research lineage (MLA, NSA) is fully documented in papers. The next Llama, Qwen, Mistral, and Kimi releases will almost certainly include some flavor of compressed-plus-sparse attention. The “1M-context for cheap” capability is no longer a moat; it is becoming a baseline. Architectural decisions made now in production systems should assume this floor.
What V4 changes about the field
The deeper point is that attention design has become the load-bearing innovation in long-context inference. For a long stretch — roughly 2020 through 2023 — the dominant axis of model improvement was parameter count: scale up, train longer, see what emerges. That story didn’t stop, but it slowed: the marginal benchmark gain from going from 1T to 1.6T total parameters is small. The marginal benchmark gain from making 1M context affordable to serve is large. So the field has moved from “scale parameters” to “scale parameters cheaply and scale context cheaply,” and that second clause is where attention engineering lives.
V4 is not the final word — it is the third visible step in DeepSeek’s attention arc and the first one that ships at production scale with open weights. The hardware side is converging on the same bet: TPU v8i’s 3× SRAM increase, NVIDIA’s GB200 unified-memory architecture, and the broader move toward keeping more of the KV cache in fast memory all read as the silicon-level corollary to what CSA/HCA does at the model level. Whatever V5 looks like, it will inherit the same constraint: bandwidth and memory, not raw FLOPs, are what gate the next order-of-magnitude improvement in long-context serving. The architecture follows.
For frontier-watchers, the takeaway is that the next major long-context paper from any lab — closed or open — should be read against the V4 baseline. “1M context, frontier-class quality, sub-30% of V3.2 compute” is the bar to beat now. It was not the bar three months ago.
Sources
- DeepSeek-AI, DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — arxiv 2405.04434
- Yuan et al., Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention — arxiv 2502.11089
- Vaswani et al., Attention Is All You Need — arxiv 1706.03762
- Child et al., Generating Long Sequences with Sparse Transformers — arxiv 1904.10509
- Beltagy et al., Longformer: The Long-Document Transformer — arxiv 2004.05150
- Zaheer et al., Big Bird: Transformers for Longer Sequences — arxiv 2007.14062
- Wang et al., Linformer: Self-Attention with Linear Complexity — arxiv 2006.04768
- Choromanski et al., Rethinking Attention with Performers — arxiv 2009.14794
- Gu and Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces — arxiv 2312.00752
- Bai et al., LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks — arxiv 2412.15204
- Simon Willison, DeepSeek V4 — simonwillison.net, April 24 2026
