MiniMax, a Chinese AI startup, released M2.5 today — a model that scores 80.2% on SWE-bench Verified while priced at 1.10 per million output tokens. That’s within 0.6 percentage points of Claude Opus 4.6’s 80.8% on the same benchmark, at roughly one-tenth to one-twentieth the cost.
If those numbers hold up under independent evaluation, M2.5 represents one of the most aggressive price-performance moves in the current AI landscape.
Architecture
M2.5 uses a mixture-of-experts (MoE) architecture with 230 billion total parameters and 10 billion active per forward pass. This high sparsity ratio — only 4.3% of parameters active per token — is what enables the dramatically lower pricing. Fewer active parameters means less compute per inference, which translates directly to lower costs.
The model was trained on over 10 programming languages across more than 200,000 real-world development environments. MiniMax also released M2.5 Lightning, an optimized variant tuned for speed-critical applications.
Benchmark Performance
| Benchmark | M2.5 | Claude Opus 4.6 | Context |
|---|---|---|---|
| SWE-bench Verified | 80.2% | 80.8% | Real-world software engineering |
| Multi-SWE-Bench | 51.3 | 50.3 | Multi-repository engineering |
| BFCL Multi-Turn | 76.8 | 68.0* | Tool calling capability |
*BFCL comparison is against Claude 4.5, not 4.6.
The Multi-SWE-Bench score is particularly interesting — M2.5 leads the field at 51.3, above Claude 4.6’s 50.3. Multi-SWE-Bench tests a model’s ability to work across multiple repositories simultaneously, which is closer to how real software engineering projects work. Leading on this benchmark with a model that costs a fraction of its competitors suggests MiniMax’s training approach is genuinely effective, not just benchmark-optimized.
The BFCL (Berkeley Function Calling Leaderboard) multi-turn score of 76.8 indicates strong tool-use capability, which matters for agentic workflows where models need to call APIs, execute code, and manage state across interactions.
Pricing Comparison
| Model | Input (per 1M tokens) | Output (per 1M tokens) | SWE-bench Verified |
|---|---|---|---|
| MiniMax M2.5 | $0.30 | $1.10 | 80.2% |
| Claude Opus 4.6 | $5.00 | $25.00 | 80.8% |
| GPT-5.3-Codex | ~$1.25* | ~$10.00* | 80.0% |
*GPT-5.3-Codex pricing estimated from predecessor; official pricing not yet announced.
The cost differential is stark. For input tokens, M2.5 is 16.7x cheaper than Opus 4.6. For output tokens, it’s 22.7x cheaper. Even against GPT-5.3-Codex’s estimated pricing, M2.5 is roughly 4-9x cheaper.
For organizations running high-volume coding workloads — CI/CD pipeline analysis, automated code review, large-scale refactoring — this pricing difference can mean the difference between an economically viable AI deployment and one that’s cost-prohibitive.
The Context Window Trade-Off
M2.5’s context window is 196.6K tokens. That’s adequate for most tasks but notably smaller than the 1M token contexts now offered by Claude Opus 4.6 and other frontier models. For workloads that require processing very large codebases or document sets in a single context, this limitation matters.
For standard software engineering tasks — where you’re working with a few files at a time and maintaining conversation context — 196K is more than sufficient.
What This Means for the Market
M2.5 is the latest evidence that the price floor for frontier AI capabilities is dropping rapidly. A year ago, achieving 80%+ on SWE-bench Verified required the most expensive models from the largest labs. Today, a Chinese startup is matching those scores at a fraction of the cost. As open-weight models narrow the gap further, the landscape of viable options is expanding fast — our open source AI models guide tracks where things stand across coding, writing, and agentic use cases.
This has implications for the build-vs-buy decision facing every organization that uses AI. When frontier-level performance is available at commodity pricing, the calculus changes. The cost of running AI agents continuously — monitoring code quality, reviewing pull requests, generating tests — becomes feasible for teams that couldn’t justify it at $25 per million output tokens.
It also intensifies competitive pressure on the major labs. Anthropic and OpenAI can justify premium pricing when they offer clearly superior performance. When a model at one-tenth the price matches or leads on key benchmarks, the premium needs to be justified by other factors: reliability, safety, support, ecosystem integration, and the kind of nuanced capability differences that don’t show up in benchmark scores.
For enterprise buyers, the practical advice is to benchmark against your specific workloads. A model that scores 80.2% on SWE-bench might perform very differently than one that scores 80.8% on your particular codebase and task distribution. At M2.5’s price point, the cost of evaluation is minimal.
Key Details
| Spec | Detail |
|---|---|
| Architecture | Mixture of Experts (MoE) |
| Total Parameters | 230 billion |
| Active Parameters | 10 billion |
| Context Window | 196.6K tokens |
| Input Pricing | $0.30 / 1M tokens |
| Output Pricing | $1.10 / 1M tokens |
| Variants | M2.5 (standard), M2.5 Lightning (speed-optimized) |
| Official Announcement | MiniMax blog |
| Independent Evaluation | OpenHands analysis |
