Small MoE Models: How Sparse Routing Makes Efficient AI Possible

Frontier AI labs are building trillion-parameter models with Mixture of Experts (MoE) architectures. That story is well-covered — DeepSeek V3, Kimi K2.5, and Qwen 3.5 all use MoE to pack massive knowledge into models that activate only a fraction of their parameters per token.

But there is a less-discussed story that matters more for most practitioners: MoE is not just a scaling trick for frontier labs. It is an efficiency architecture that fundamentally changes what is possible at small scale. A 47-billion-parameter MoE model that matches a 70B dense model. A 16B MoE model that rivals a 7B dense model at 40% of the compute. A 6.6B-active-parameter model that outperforms every dense model under 14B.

These are not theoretical results. They are shipping models — running on consumer GPUs, powering production systems, and reshaping what “small model” means. This guide covers the architectural patterns, the efficiency math, and the practical deployment tradeoffs of small and mid-size MoE models.

The Core Efficiency Argument

The foundational MoE article explains the mechanics of gating, expert routing, and load balancing in detail. Here, we focus on the practical consequence: what does sparse routing actually buy you at inference time?

Active vs. Total Parameters

Every MoE model has two parameter counts that matter:

Total parameters: all weights across all experts. Determines memory footprint and knowledge capacity.
Active parameters: the subset of weights used for any given token. Determines compute cost (FLOPs) and effective inference speed.

The ratio between these two numbers is the source of MoE’s efficiency. A model with 47B total parameters but 13B active parameters stores knowledge like a 47B model but computes each token like a 13B model.

This distinction does not exist in dense models. When Llama 2 70B processes a token, all 70 billion parameters participate — including the billions encoding knowledge about topics completely irrelevant to the current input. The compute is spent regardless.

The Efficiency Math

The feed-forward network (FFN) dominates per-token compute in a transformer — roughly two-thirds of the FLOPs per layer. MoE replaces the single dense FFN with $N$ expert FFNs and activates only $k$ of them per token. The compute savings on the FFN portion are:

$\text{FFN FLOPs}_{\text{MoE}} = \frac{k}{N} \cdot \text{FFN FLOPs}_{\text{dense}}$

Source: FFN FLOPs_MoE = (k / N) * FFN FLOPs_dense

For Mixtral 8x7B with $k=2$ and $N=8$ , this is $2/8 = 25\%$ of the dense FFN cost. For OLMoE with $k=8$ and $N=64$ , it is $8/64 = 12.5\%$ . The attention layers, embeddings, and router are still dense — so total per-token savings are less dramatic than the FFN ratio alone suggests — but the net effect is still substantial.

The key insight: MoE models achieve quality closer to their total parameter count but inference speed closer to their active parameter count. This is why a 47B MoE can match a 70B dense model while running significantly faster.

What the Benchmarks Show

The efficiency gains are consistent across every model family that has released both dense and MoE variants:

MoE Model	Active Params	Dense Equivalent	Dense Params	Efficiency Ratio
Mixtral 8x7B	12.9B	Llama 2 70B	70B	5.4x
DeepSeek-MoE 16B	2.8B	DeepSeek 7B	7B	2.5x
Qwen1.5-MoE-A2.7B	2.7B	Qwen1.5-7B	7B	2.6x
Phi-3.5-MoE	6.6B	~Dense 14B class	14B	2.1x

“Efficiency ratio” here means: the MoE model matches a dense model with $N\times$ more active parameters. Mixtral 8x7B is the standout — matching a model with 5.4 times more active parameters — but every entry demonstrates that MoE consistently delivers more capability per FLOP.

Research from ICLR 2024 (Shen et al.) formalized this finding: the cost-performance Pareto frontier for MoE dominates dense models “by a wide margin” across all model scales, from small to XXL. MoE is not just better at frontier scale — it is better at every scale.

The Small MoE Landscape

The wave of small and mid-size MoE models began in late 2023 and accelerated through 2024. Each model made distinct architectural choices that reveal how the design space is being explored.

Mixtral 8x7B: The Model That Started It All

Released by Mistral AI in December 2023, Mixtral 8x7B (Jiang et al., 2024) was the first open-weight MoE model to achieve genuine frontier-competitive performance. It demonstrated that sparse routing was not just a research curiosity — it was a production-ready architecture.

Architecture: 46.7B total parameters, 12.9B active. 8 experts per layer with top-2 routing. 32K token context window. Each expert is a full-size FFN block inherited from the Mistral 7B architecture, so the model is structurally “8 copies of Mistral 7B’s FFN” with shared attention layers.

The headline result: Mixtral matched or exceeded Llama 2 70B on most benchmarks — MMLU 70.6%, HellaSwag 86.7%, ARC Challenge 85.8% — while activating only 12.9B parameters per token versus Llama 2’s 70B. On code and math tasks, Mixtral was “vastly superior” to Llama 2 70B. Inference throughput was roughly 6x faster.

Why it mattered: Before Mixtral, the open-source community’s best models were dense — Llama 2 70B, Mistral 7B, and their fine-tunes. Mixtral proved that you could get 70B-class quality at a fraction of the inference cost with an open-weight model. It catalyzed an entire generation of MoE development.

Mixtral 8x22B followed in April 2024 — same 8-expert, top-2 architecture but with 22B-parameter experts, totaling 141B parameters with 39B active. MMLU rose to 77.3%, positioning it between Llama 2 70B and the then-upcoming Llama 3 70B.

DeepSeek-MoE 16B: Fine-Grained Expert Design

Released in January 2024 (Dai et al., 2024), DeepSeek-MoE 16B introduced two architectural innovations that would later influence DeepSeek V2 and V3.

Architecture: 16.4B total parameters, 2.8B active. 2 shared experts (always active) plus 64 routed experts per MoE layer. Top-6 routing from the routed pool, so 8 total experts (6 routed + 2 shared) process each token. Each expert is 0.25x the size of a standard FFN.

Innovation 1 — Fine-grained expert segmentation: Instead of Mixtral’s approach of 8 large experts, DeepSeek used 64 small experts. This is not just a scaling difference — it fundamentally changes the combinatorial space of routing. With 8 experts and top-2 routing, there are $\binom{8}{2} = 28$ possible expert combinations per token. With 64 experts and top-6 routing, there are $\binom{64}{6} = 74,974,368$ possible combinations. The model can express dramatically more nuanced routing patterns.

Innovation 2 — Shared expert isolation: The 2 always-active shared experts capture universal patterns — basic syntax, common semantic operations — that apply across all inputs. This frees the 64 routed experts from needing to redundantly encode common knowledge, letting them specialize more aggressively. Without shared experts, every routed expert must carry some general-purpose capability as a fallback, diluting specialization.

The result: DeepSeek-MoE 16B matched DeepSeek 7B (dense) with only 40.5% of the computation. It outperformed Llama 2 7B on most benchmarks with 39.6% of the computation. These efficiency ratios validated fine-grained expert design as a viable alternative to Mixtral’s coarser approach.

Phi-3.5-MoE: Microsoft’s Efficiency Play

Released in August 2024, Phi-3.5-MoE demonstrated that MoE could push small-model performance into territory previously reserved for much larger dense models.

Architecture: 42B total parameters, 6.6B active. 16 experts per layer with top-2 routing. 128K token context window. Trained on 4.9 trillion tokens over 23 days on 512 H100 GPUs.

The headline result: With only 6.6B active parameters, Phi-3.5-MoE achieved 78.9% on MMLU — outperforming Llama 3.1 8B, Mistral-Nemo-12B, and Gemma-2-9B, and performing comparably to Gemini 1.5 Flash and GPT-4o-mini on several tasks. It supported 20+ languages despite its compact active size.

GRIN MoE: Microsoft developed a novel training method called GRadient-INformed MoE (Mirzadeh et al., 2024) that uses gradient information to improve expert specialization and parameter utilization. A GRIN-trained variant scored 79.4 MMLU, 74.4 HumanEval, and 58.9 MATH.

Why it matters: Phi-3.5-MoE showed that the “small model” ceiling could be raised substantially by switching from dense to sparse architecture. A model with active parameters comparable to a 7B dense model was competing with models 2x its active size.

DBRX: Dropless Routing

Released by Databricks in March 2024, DBRX explored a different dimension of MoE design — how to eliminate token dropping entirely.

Architecture: 132B total parameters, 36B active. 16 experts per layer with top-4 routing. Trained on 12 trillion tokens.

Innovation — Dropless MoE via MegaBlocks: Standard MoE implementations set a fixed capacity factor for each expert — a buffer size limiting how many tokens an expert can process per batch. When an expert is popular and its buffer fills up, excess tokens are “dropped” (passed through unchanged without expert processing). This wastes routing decisions and degrades quality.

DBRX used MegaBlocks, a framework that reformulates MoE computation as block-sparse matrix operations. Instead of fixed-capacity expert buffers, each expert dynamically processes however many tokens are routed to it. No tokens are ever dropped. This eliminates a fundamental source of quality loss in MoE training and inference.

Combinatorial advantage: With 16 experts and top-4 routing, DBRX has $\binom{16}{4} = 1{,}820$ possible expert combinations per token — compared to Mixtral’s $\binom{8}{2} = 28$ . This 65x increase in routing expressiveness, combined with dropless routing, contributed to DBRX achieving 73.7% MMLU and surpassing GPT-3.5 on most benchmarks at launch.

OLMoE: Fully Open MoE Research

Released by the Allen Institute for AI (AI2) in September 2024 (Muennighoff et al., 2024), OLMoE is notable both for its architecture and its openness — all code, data, weights, training logs, and intermediate checkpoints were released.

Architecture: 6.9B total parameters, 1.3B active. 64 experts per layer with top-8 routing. Trained on 5.1 trillion tokens over 10 days on 256 H100 GPUs.

The significance of 1.3B active: OLMoE proved that MoE efficiency holds even at very small active parameter counts. With only 1.3B active parameters, it achieved state-of-the-art performance among models with fewer than 2B active parameters, outperformed Llama 2 7B on several benchmarks, and was competitive with Llama 2 13B — despite computing each token with roughly 1/10th the FLOPs.

Training efficiency: OLMoE trained 2x faster than equivalent dense models at the same total parameter count, demonstrating that sparse computation accelerates not just inference but training.

JetMoE: MoE in Attention Too

Released in April 2024 (Shen et al., 2024), JetMoE-8B explored an architectural direction that most MoE models leave untouched: applying sparse routing to the attention layers, not just the FFN.

Architecture: 8B total parameters, 2.2B active. 24 transformer blocks, each containing two MoE layers — a Mixture of Attention heads (MoA) with 8 attention experts (top-2) and a Mixture of MLP Experts with 8 FFN experts (top-2).

The attention MoE insight: In standard MoE, only the FFN layers are sparse — the attention layers remain dense and execute fully for every token. Since attention accounts for roughly one-third of per-layer compute, this limits the maximum possible savings from sparsification. JetMoE’s Mixture of Attention applies the same routing principle to attention heads, potentially sparsifying the remaining compute.

Cost efficiency: JetMoE-8B was trained for less than $100,000 using 30,000 H100 GPU hours. Despite this modest budget, it outperformed Llama 2 7B and Llama 2 13B, reducing inference computation by ~70% compared to Llama 2 7B.

Qwen MoE: The Upcycling Path

Alibaba’s Qwen team explored MoE across multiple model sizes, but Qwen1.5-MoE-A2.7B (March 2024) is particularly interesting because of how it was created.

Architecture: 14.3B total parameters, 2.7B active. 4 shared experts (always active) plus 60 routed experts per layer with top-4 routing from the routed set. Each expert is a fine-grained partition of the FFN, not a full-size copy.

Upcycling from dense: Rather than training from scratch, Qwen1.5-MoE was upcycled from a pre-trained dense Qwen-1.8B model. The single dense FFN was partitioned into multiple expert segments, shared experts were designated, and the router was initialized and trained. This approach reduced training costs by 75% compared to training an equivalently-performing Qwen1.5-7B from scratch.

The result: Qwen1.5-MoE-A2.7B matched Qwen1.5-7B (MMLU 62.5% vs 61.0%) and was competitive with Mistral 7B (MMLU 64.1%) while activating only 2.7B parameters. Inference throughput was 1.74x faster than the dense equivalent on an A100-80G GPU (4,010 vs 2,299 tokens/sec).

The upcycling lesson: You do not necessarily need to train an MoE model from scratch. A well-trained dense model can be converted into an MoE model by partitioning its FFN into experts and training a router on top. This dramatically lowers the barrier to experimenting with MoE architectures.

Architectural Trends

Looking across these models, several clear design trends emerge.

More Experts, Smaller Each

The evolution is striking:

Generation	Model	Experts	Expert Size
Early (2023)	Mixtral 8x7B	8	Full FFN (~7B each)
Mid (2024)	DBRX	16	Half FFN
Fine-grained (2024)	DeepSeek-MoE 16B	64	Quarter FFN
Ultra-fine (2024)	OLMoE	64	~100M each
Frontier (2025-26)	Qwen 3.5	512	Tiny FFN

The trend toward many small experts has been validated at every scale. More experts means more possible routing combinations, which means more nuanced specialization. The combinatorial richness of selecting 8 experts from a pool of 64 vastly exceeds selecting 2 from a pool of 8 — even if the total parameter budget is identical.

This is a counter-intuitive result. You might expect that larger, more powerful individual experts would outperform smaller, weaker ones. But the evidence consistently shows that the combination of multiple small specialists outperforms fewer large generalists, because the routing network can compose different expert capabilities for different inputs.

Shared Experts Are Standard

DeepSeek-MoE 16B, Qwen1.5-MoE, and all subsequent DeepSeek and Qwen models use shared experts — sub-networks that process every token regardless of routing decisions. The rationale is simple: some operations are universal. Basic syntax processing, common semantic transformations, and foundational linguistic operations apply to every input. Without shared experts, every routed expert must redundantly encode these universal capabilities, wasting capacity that could go toward specialization.

The typical pattern is 1-4 shared experts plus a much larger pool of routed experts. The shared experts act as a “baseline” processor, and the routed experts provide specialized augmentation on top.

MoE Beyond FFN Layers

JetMoE’s application of sparse routing to attention heads opens a design dimension that most models have not yet explored. If MoE works for FFN layers — which account for ~2/3 of per-layer compute — it should logically also work for attention layers, which account for the remaining ~1/3.

The challenge is that attention heads have more complex interdependencies than FFN blocks. In multi-head attention, different heads learn to attend to different aspects of the input (positional patterns, syntactic relationships, semantic similarities). Routing tokens to different subsets of attention heads risks disrupting these learned patterns. JetMoE’s results suggest the approach is viable, but it has not yet been adopted by larger models.

Upcycling vs. Training from Scratch

Qwen’s upcycling approach — converting a trained dense model into an MoE model — represents a potentially important efficiency path. Training a competitive dense model and then converting it to MoE requires substantially less total compute than training an MoE model from scratch, because the expert weights are initialized with useful representations rather than random values.

The tradeoff is that upcycled models may not achieve the same level of expert specialization as models trained from scratch, because the router has less opportunity to co-evolve with the expert weights. But for practitioners with limited compute budgets, upcycling offers a practical path to MoE benefits.

Deployment on Consumer Hardware

The promise of small MoE models is that they deliver outsized quality for their compute cost. But deploying them requires navigating a specific set of tradeoffs that differ from dense models.

The Memory Problem

Here is the fundamental tension: an MoE model’s compute cost scales with active parameters, but its memory cost scales with total parameters. All experts must reside in memory because routing decisions are dynamic — you cannot predict which experts a future token will need.

Mixtral 8x7B has 46.7B total parameters. At FP16 (2 bytes per parameter), that is ~93 GB — more than a single RTX 4090’s 24 GB VRAM. At 4-bit quantization (Q4_K), it drops to ~26 GB, which fits on a single 32GB GPU or a pair of 16GB GPUs.

Compare this to Llama 2 13B, a dense model with roughly similar MMLU scores: 13B parameters at FP16 is ~26 GB, or ~7 GB at 4-bit quantization. The dense model fits comfortably on a single consumer GPU with room to spare.

So while Mixtral is faster per token than Llama 2 70B (its quality-equivalent dense model), it requires more memory than Llama 2 13B (a dense model it merely matches on some benchmarks). The comparison that matters depends on what you are optimizing for:

Optimizing for quality at fixed compute budget: MoE wins. Mixtral gives you 70B-class quality at 13B-class speed.
Optimizing for quality at fixed memory budget: Dense might win. If you can only afford 8 GB of VRAM, a well-tuned dense 7B model may outperform whatever MoE model you can squeeze into that space.

Quantization Is More Important for MoE

Because MoE models are memory-bottlenecked rather than compute-bottlenecked, quantization has an outsized impact. Reducing precision from FP16 to INT4 cuts memory by 4x — and since memory is the binding constraint, this directly translates to accessibility.

Mixtral 8x7B quantization tiers:

Quantization	Memory	Quality Impact	Hardware
FP16	~93 GB	Baseline	Multi-GPU or high-end Apple Silicon
INT8	~47 GB	Minimal	Dual RTX 4090 or M2 Ultra
Q5_K	~32 GB	Very small	Single GPU + partial offload
Q4_K	~26 GB	Small	Dual RTX 3090 or M3 Max
Q3_K	~20 GB	Moderate	Single RTX 4090

MoE models appear to be reasonably robust to quantization — potentially more so than dense models. The hypothesis is that each expert operates on a narrower distribution of inputs than a dense model’s FFN, producing weight distributions that are more quantization-friendly. The router weights are typically kept at full precision since routing errors (sending a token to the wrong expert) are more damaging than small precision losses within an expert.

Inference Speed in Practice

Real-world benchmarks for Mixtral 8x7B (Q4_K quantized via llama.cpp):

Hardware	Tokens/sec	Notes
Dual RTX 4090 (48 GB)	~59	Fastest consumer option
Dual RTX 3090 (48 GB)	~54	Strong value option
Apple M2 Ultra (192 GB)	~44	Unified memory advantage
Apple M3 Max (128 GB)	~22	Viable for development
Single GPU + CPU offload	~7	Functional but slow

For comparison, Llama 2 70B (the quality-equivalent dense model) at Q4_K on the same hardware runs at roughly 8-15 tokens/sec on dual consumer GPUs — Mixtral is 4-7x faster because it activates far fewer parameters per token despite occupying less total memory.

This is where the MoE value proposition is clearest: if you have enough memory to hold the model, you get better quality per token of latency than any dense model.

Expert Offloading

For hardware that cannot hold all experts in GPU memory simultaneously, expert offloading moves inactive experts to CPU RAM and loads them into GPU memory on demand. This trades latency for accessibility:

GPU-only: All experts in VRAM. No latency penalty. Requires enough VRAM for full model.
GPU + CPU offload: Active experts in VRAM, inactive experts in CPU RAM. Adds microseconds to milliseconds per expert swap. Viable for interactive use on lower-VRAM GPUs.
GPU + disk offload: Most experts on NVMe SSD. Adds milliseconds per swap. Only viable for batch processing.

Modern inference frameworks like vLLM and llama.cpp implement predictive offloading — analyzing routing patterns to pre-fetch experts that are likely to be needed, hiding much of the latency penalty.

When to Choose MoE vs. Dense

The decision between MoE and dense architectures depends on specific deployment constraints and priorities.

Choose MoE When

You need maximum quality per FLOP. If your bottleneck is inference compute cost (cloud GPU bills, latency requirements, throughput targets), MoE delivers more capability per unit of compute than any dense architecture.
You have sufficient memory headroom. If your deployment hardware has enough RAM/VRAM to hold the full MoE model (possibly quantized), you get the quality benefits without the compute costs.
You are serving at scale. At high request volumes, MoE’s lower per-token compute cost compounds into significant cost savings. The higher base memory cost is amortized across many requests.
You need a specific knowledge breadth. MoE’s larger total parameter count stores more factual knowledge than a dense model with the same active parameters. If your use case requires broad knowledge (question answering, research assistance, multilingual support), MoE has an advantage.

Choose Dense When

Memory is the binding constraint. If you are deploying on a single consumer GPU with limited VRAM, a well-optimized dense model may be the only option. A dense 7B model at Q4 needs ~4 GB; the smallest competitive MoE model needs 4-5x more.
Deployment simplicity matters. Dense models are simpler to serve — no routing overhead, no expert parallelism needed, more predictable memory access patterns. For edge deployment or embedded systems, this simplicity has real value.
Your use case is narrow. If your application focuses on a single domain (code generation, medical QA, translation), a fine-tuned dense model may outperform a general-purpose MoE model because all of its parameters are relevant to the task at hand.

The Hybrid Future

The clean MoE-vs-dense distinction is already blurring. Qwen 3.5 uses hybrid attention (combining standard softmax attention with linear attention). Snowflake Arctic uses a dense backbone with MoE augmentation. JetMoE applies sparsity to both attention and FFN layers.

The trajectory points toward models that are sparse where sparsity helps (FFN layers, where different inputs benefit from different processing) and dense where density helps (attention layers, where global context integration requires full participation). The question is not “MoE or dense” but “which layers should be sparse, and how sparse?”

What This Means for Practitioners

The practical implications of small MoE models are straightforward:

The quality floor has risen. A year ago, if you needed 70B-class performance, you needed 70B-class hardware. Today, Mixtral 8x7B delivers comparable quality on hardware that costs a fraction as much. The gap between “what you can run locally” and “what frontier APIs offer” has narrowed significantly. For a practical overview of which open-weight MoE and dense models are worth running today — across coding, writing, agents, and multimodal tasks — see our open source AI models guide.

Memory is the new bottleneck. For dense models, compute and memory scale together — a bigger model is both slower and larger. For MoE models, they decouple. The limiting factor for local MoE deployment is almost always memory, not compute. This shifts hardware purchasing decisions toward memory-rich configurations (Apple Silicon with unified memory, multi-GPU setups, high-RAM servers) rather than pure compute power.

Upcycling lowers the barrier. Qwen’s demonstration that dense models can be converted to MoE post-training means that any organization with a trained dense model can experiment with MoE. You do not need to retrain from scratch — you can partition an existing model’s FFN into experts, train a router, and evaluate the results at modest cost.

The architecture is proven. Small MoE models have been validated by Mistral, DeepSeek, Microsoft, Databricks, AI2, Alibaba, and numerous independent researchers. The training techniques (load balancing, shared experts, fine-grained routing) are well-understood. The inference frameworks (vLLM, llama.cpp, TensorRT-LLM) support MoE natively. This is not experimental technology — it is a mature, production-ready architecture that happens to be newer than the dense approach it is gradually replacing.

The fundamental insight holds at every scale: not every parameter needs to participate in every computation. For practitioners building AI systems or evaluating model architectures for production use, MoE is no longer a frontier-only technique. It is the most efficient path to capable AI at any scale.