Back to Insights
AI

DeepSeek V4: A 1.6-Trillion-Parameter Open Model at 1/7 the Cost of GPT-5.5

DeepSeek V4 Pro ships open weights, a 1M context window, and pricing 6-7x cheaper than Western frontier models. It is not better than GPT-5.5 or Claude Opus 4.7 — but it is close, and that is the point.

S5 Labs Team April 24, 2026

DeepSeek released V4 in preview on April 24, with two variants — V4-Pro at 1.6 trillion total parameters (49B active), and V4-Flash at 284 billion total parameters (13B active). Both are open-weight under MIT license, both support a 1M-token context, and both are priced at a fraction of what OpenAI and Anthropic charge for comparable capability. V4-Pro lands at 0.145permillioninputtokensand0.145 per million input tokens and 1.74 per million output tokens. GPT-5.5 lists at 5.00and5.00 and 30.00. That is roughly 35x and 17x more expensive, respectively.

The model is not the frontier — it sits below GPT-5.5 and Claude Opus 4.7 on most benchmarks. But it is close enough that the cost gap is the story, not the capability gap. The implication is the same as the one DeepSeek caused with V3 last year, only sharper: the gap between the U.S. frontier and a credible Chinese alternative is now measured in basis points, and the cost advantage is measured in multiples.

The Architecture

DeepSeek V4-Pro uses a hybrid attention architecture combining what the paper calls Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). At a 1M-token context, this drops single-token inference FLOPs to 27% of DeepSeek V3.2’s — a 3.7x compute reduction at the same context length. That is the math behind the pricing.

Active parameter counts are the other half of the story:

  • V4-Pro: 1.6T total, 49B active per token (MoE with ~3% activation rate)
  • V4-Flash: 284B total, 13B active per token

These are similar shapes to MiniMax M2.5 and Kimi K2.5 — large sparse mixtures with small active footprints. The pattern across the Chinese open-weight wave is consistent: total parameters scale to capability, active parameters scale to inference cost, and the gap between the two is where the economics happen.

Where It Sits on Benchmarks

The benchmark picture is roughly:

  • SWE-Bench Verified: 80.6% — a hair behind GPT-5.5’s 82.7%, ahead of Claude Sonnet 4.6
  • Terminal-Bench 2.0: high 70s — competitive with Gemini 3 Pro, behind GPT-5.5
  • Math and reasoning (AIME, MATH-500): within 2-3 points of the frontier

“Within 2-3 points” is the headline finding. A year ago, that gap was 10-15 points. The Chinese open-weight ecosystem has not just closed; it has effectively merged with the frontier on most tasks that enterprises actually run.

The Pricing Disruption

V4-Flash is the line item that should make every model-buying enterprise stop:

  • Input: $0.14 per million tokens
  • Output: $0.28 per million tokens

For comparison, GPT-5.4 Mini is 0.50/0.50/2.00, and Claude Sonnet 4.6 is 1.50/1.50/7.50. V4-Flash undercuts both by a meaningful multiple while providing comparable performance on the workloads that drive volume — classification, extraction, lightweight reasoning, agent subtasks.

The shape of this pricing matters more than the absolute numbers. If you are running a subagent architecture — one frontier model orchestrating dozens of cheaper models — V4-Flash is now the cheapest credible subagent on the market. That is the architecture pattern GPT-5.4 Mini and Nano were built for, and DeepSeek just stapled a 70% discount onto the bottom of that stack.

What “Open Weights” Means in Practice

Both V4 variants are released under MIT license. That is full commercial use, no field-of-use restrictions, and no attribution requirements — meaningfully more permissive than Llama’s community license or Gemma’s Apache 2.0 with prohibited-use clauses.

For enterprises that need to run inference on their own infrastructure — for compliance, sovereignty, or sheer cost reasons — V4 is the most capable open model on the market by a clear margin. The 1.6T parameter count means it will not run on a single GPU, but on a properly configured cluster (or NVIDIA’s GB200 NVL72), self-hosted V4-Pro is cheaper per token than calling any API at scale.

This matters for the same reason Qwen3.5 mattered: the cost floor on capable inference is now controlled by Chinese open-weight labs, not by the closed frontier. Western labs can price above that floor for the capability gap and the trust premium, but they cannot price independently of it anymore.

The Geopolitics

The Stanford AI Index reported earlier this month that the U.S.-China frontier gap is now 2.7% on top-line benchmarks — a finding that was already historic when it landed. DeepSeek V4 is the next data point on that curve, and it sharpens the policy question: export controls on H100s and Blackwell did not prevent DeepSeek from training a 1.6T model. They made it harder, more expensive, and slower, but they did not prevent it.

The 27% inference-FLOP figure also speaks to where the Chinese ecosystem is innovating: less on raw chip count, more on architectural efficiency. CSA and HCA are not products of a brute-force GPU cluster — they are products of a research culture optimizing for what the available silicon can do.

What to Do With This

For builders, V4 is now the rational default for anything price-sensitive and quality-tolerant — bulk extraction, classification, mid-tier agent work, internal tooling. Anywhere you would have called GPT-4o-mini or Claude Haiku, V4-Flash is cheaper and at least as good. Keep the frontier models for the workloads that actually need them, which on most product surfaces is a smaller fraction than the model bill suggests.

For platform decisions, the harder question is single-vendor versus multi-vendor model routing. If your stack already runs through OpenRouter or a Bedrock-style abstraction, V4 is a line in your routing config. If it runs through a single proprietary SDK, the cost of switching to V4 may eat the savings — at least until your inference bill grows enough to justify the engineering work.

What Comes Next

Three things to watch in the V4 wake:

  1. A V4-MoE distillation — DeepSeek’s pattern is to ship a dense model after the MoE. A 70B-class dense distillation would be the model that most labs and startups actually deploy.
  2. OpenAI’s response — GPT-5.5 Mini and Nano are the obvious move. Whether they undercut V4-Flash on price is the open question.
  3. Anthropic’s silence — Anthropic has not shipped a small-model tier comparable to GPT-Mini or V4-Flash. The cost-floor pressure will eventually force the issue.

The frontier is still American. The cost floor is now Chinese. Both of those statements are now permanently true, and the next 12 months of pricing decisions across the industry will be a response to that fact.

Sources

Want to discuss this topic?

We'd love to hear about your specific challenges and how we might help.