MoE architectures now dominate the frontier
Qwen 3.5, Llama 4, DeepSeek-V3.2, Mistral Large 3, and gpt-oss all use mixture-of-experts, delivering frontier quality with a fraction of the active parameters. Teams should plan for MoE-friendly serving.
A practical, no-nonsense guide for founders, engineers, and AI teams deciding which open source or open-weight models are actually worth testing — by workload, benchmark profile, license fit, and hardware reality.
If you only need the short version, it is this: most teams evaluating open models in 2026 should begin with a strong 7B-32B text model, an explicit evaluation harness, and a clear hardware budget before they touch giant MoE systems, open video stacks, or "frontier" model marketing.
Qwen 3.5, Llama 4, DeepSeek-V3.2, Mistral Large 3, and gpt-oss all use mixture-of-experts, delivering frontier quality with a fraction of the active parameters. Teams should plan for MoE-friendly serving.
DeepSeek-V3.2 matches GPT-5 on reasoning benchmarks. Qwen 3.5 397B competes with Claude 4.5 Opus on multimodal tasks. OpenAI's gpt-oss models run on consumer hardware. The gap between open and closed has effectively closed for many workloads.
A model can publish weights without publishing the full training data or reproducible training recipe. That matters for auditability, governance, and legal clarity.
Qwen 3.5, Llama 4, and Gemma 3 all handle text and images natively. Separate vision adapters are no longer the primary path for multimodal work.
The right model is not the one with the biggest benchmark headline. It is the one that clears your real task evaluations, fits your hardware envelope, survives structured-output tests, and carries a license your company can live with.
This matrix is designed to help teams choose a sensible starting point instead of trying everything at once.
| Use case | Start here | Why | Hardware reality | Watch-out |
|---|---|---|---|---|
| General writing, chat, summaries, RAG | Qwen 3.5 9B/27B or Llama 4 Scout | Strong general capability with efficient MoE inference and native multimodal support. | 12–24GB VRAM for smaller Qwen 3.5 variants; Scout fits on a single H100 with on-the-fly int4 quantization. | Check license and language fit before standardizing. |
| Coding assistant for real development work | Qwen3-Coder-30B-A3B or Qwen3-Coder-Next | Purpose-built for agentic coding with strong tool-use and environment interaction. | Qwen3-Coder-Next runs comfortably on 24GB VRAM thanks to hybrid MoE. | You still need tests, linting, and security review. |
| Tool-using agents and workflow automation | Qwen 3.5 Instruct or Llama 4 Maverick | Native tool-calling support, strong structured output, and broad ecosystem adoption. | 9B–35B is the practical starting band for Qwen 3.5. | Agent quality depends as much on orchestration and evals as on the base model. |
| Top-end open reasoning and large-scale inference | DeepSeek-V3.2 | GPT-5-level reasoning with integrated thinking and tool-use. V3.2-Speciale variant exceeds GPT-5 on math and reasoning but does not support tool calling. | Datacenter-class serving is the realistic target (685B total params, 37B active). | Too large for most local teams, but API access is widely available. |
| Vision-language understanding | Qwen 3.5 (natively multimodal) or Gemma 3 12B/27B | Native multimodal architectures handle text, images, and video without separate adapters. | Single prosumer GPU for Gemma 3 27B; Qwen 3.5 small models fit consumer hardware. | Use dedicated evaluation for grounding, not just chat quality. |
| New image generation | SDXL or FLUX.1 schnell | Best mix of maturity, workflow support, and local deployment options. | SDXL is happiest around 12GB VRAM; more if using refiners heavily. | Commercial terms vary by checkpoint. |
| Image editing, control, inpainting | Diffusers + ControlNet + SAM 2 + LaMa | Editing is a stack problem, not a single-model problem. | Consumer GPUs work well for many workflows. | Workflow quality depends on masks, conditioning, and operator skill. |
| Speech and audio | Whisper large-v3 | Best open speech-to-text model with broad language coverage and strong accuracy. | Runs comfortably on consumer GPUs; large-v3 needs ~10GB VRAM. | AudioCraft weights are CC-BY-NC 4.0, limiting commercial music/audio generation use. |
| Open video experiments | Wan2.1 or FramePack-style local workflows | Practical on better consumer hardware and improving rapidly. | Expect 16–24GB VRAM to be comfortable; smaller variants can go lower. | Video quality, latency, and consistency remain uneven. |
Choose the smallest model that reliably completes your real task with the right output shape. Then move up only if the gains are measurable.
This page separates fully open releases from open-weight releases and license-restricted releases because the market still collapses those categories into the same marketing label.
Weights, code, and meaningful training information are available. These releases are the closest match to the OSI-style vision of Open Source AI.
You can download and run the weights, but the full training data and recipe are not completely reproducible. This is where most high-performing "open" models sit today.
Some code and weights are available, but there are revenue thresholds, non-commercial clauses, or behavioral restrictions that matter in production.
Open AI is no longer one category. The ecosystem now includes general-purpose LLMs, code-specialized models, multimodal models, image generators, image-editing stacks, and increasingly capable video systems.
| Category | Best for | Start here | Move up to | Sweet spot | Watch-outs |
|---|---|---|---|---|---|
| Writing / general LLM | Chat, drafting, summarization, RAG, internal copilots | Qwen 3.5 9B or Llama 4 Scout | Qwen 3.5 397B-A17B, Llama 4 Maverick, Mistral Large 3, DeepSeek-V3.2 | Qwen 3.5 27B–35B-A3B or gpt-oss-20b covers most team needs. | License terms and quantization quality matter more than leaderboard hype. |
| Coding | PR assistance, code generation, refactors, test writing | Qwen3-Coder-Next or Qwen3-Coder-30B-A3B | Qwen3-Coder-480B-A35B, Mistral Large 3, or DeepSeek-V3.2 | The 30B-A3B MoE variant is the most practical for real developer use. | Do not deploy without tests, sandboxing, and dependency/security review. |
| Agents | Tool use, workflow automation, multi-step task execution | Qwen 3.5 27B or Llama 4 Scout | Llama 4 Maverick, Mistral Large 3, or DeepSeek-V3.2 | Smaller models plus strong orchestration often beat oversized models with weak tooling. | JSON breakage, tool misuse, and cascading failures are the real bottlenecks. |
| Multimodal / VLM | Document understanding, image Q&A, visual agents, OCR-heavy workflows | Gemma 3 12B/27B or Qwen 3.5 9B | Qwen 3.5 397B-A17B or Llama 4 Maverick | Gemma 3 27B is the most practical local starting point with 128K context. | Grounding mistakes and OCR hallucinations still require checks. |
| Image generation | Concept art, marketing assets, ideation, product visuals | SDXL | FLUX.1 dev or SD3.5 when licensing permits | SDXL remains the safest default for broad ecosystem compatibility. | Typography and exact prompt fidelity still need workflow iteration. |
| Image editing | Inpainting, control, masking, pose/depth guidance, product edits | ControlNet + SAM 2 + LaMa + Diffusers | Project-specific editing stacks with custom masks and pipelines | Editing quality comes from stack design, not one magic checkpoint. | Commercial rights differ across base checkpoints and extensions. |
| Speech / audio | Transcription, translation, voice interfaces, audio understanding | Whisper large-v3 | Qwen2.5-Omni for unified multimodal audio + text | Whisper large-v3 covers most transcription and translation needs. | AudioCraft code is MIT but model weights are CC-BY-NC 4.0. Verify license for audio generation. |
| Video generation | Short exploratory clips, motion concepts, early creative prototyping | Wan2.1 small variants | Open-Sora 2.0-style research stacks | Today, open video is a prototyping tool more than a production default. | Temporal flicker, identity drift, and long render times remain common. |
| Video editing | Interpolation, inpainting, retiming, experimental edit pipelines | RIFE, ProPainter, Wan2.1 VACE-style workflows | Custom pipelines for domain-specific video tasks | Use specialized tools rather than expecting one general model to handle everything. | Workflow complexity is high and results are sensitive to clip quality and masking. |
The biggest change from 2024 to 2026 is not just raw model quality. It is the breadth of credible open options across text, coding, multimodal, and media generation.
These numbers are useful as a map, not as a verdict. Benchmark settings vary. Prompt formatting moves scores. Preference benchmarks can overstate real operational reliability. Use this as the first filter, then test on your own workload.
| Model | General | Reasoning | Coding | Notes |
|---|---|---|---|---|
| Qwen 3.5 397B-A17B | MMLU-Pro 87.8, SuperGPQA 70.4 | AIME26 91.3, GPQA Diamond 88.4 | LiveCodeBench v6 83.6 | Frontier MoE with only 17B active params. Native multimodal, 201 languages. Competes with Claude 4.5 Opus. |
| DeepSeek-V3.2 | Comparable to GPT-5 | IMO and IOI gold-medal level | SWE-bench competitive with GPT-5 | 685B total / 37B active MoE. Integrated thinking + tool-use. Speciale variant exceeds GPT-5 on reasoning but drops tool calling. |
| Mistral Large 3 | MMLU-Pro ~73–78 | Strong mid-to-high tier | HumanEval ~92 | 675B total / 41B active MoE. Apache 2.0. Multimodal. 256K context. |
| Llama 4 Maverick | MMLU 85.5, MMLU-Pro 80.5 | GPQA Diamond 69.8 | HumanEval 82.4 | 400B total / 17B active, 128 experts. Natively multimodal, 1M context. FP8 fits on a single H100 DGX host. |
| Llama 4 Scout | MMLU 79.6, MMLU-Pro 74.3 | GPQA Diamond 57.2 | HumanEval 74.1 | 109B total / 17B active, 16 experts. 10M context. Fits on a single H100 with on-the-fly int4 quantization. |
| gpt-oss-120b | MMLU-Pro 90.0 | AIME 2025 97.9 (with tools) | Near o4-mini on competition coding | 117B total / 5.1B active MoE. Apache 2.0. Fits on a single 80GB GPU. 128K context. |
| gpt-oss-20b | Matches o3-mini on common benchmarks | Strong for its size class | Competitive with o3-mini | 21B total / 3.6B active MoE. Apache 2.0. Runs on 16GB devices. 128K context. |
| Gemma 3 27B IT | MMLU-Pro 67.5, MMMU 64.9 | GPQA Diamond 42.4, MATH 69.0 | LiveCodeBench 29.7 | Natively multimodal. Comparable to Gemini 1.5 Pro (mixed results, not a blanket win). 128K context, 140+ languages. Chatbot Arena Elo 1338. |
| OLMo 2 13B | MMLU 81.5 | Competitive with equivalently-sized open models | Less emphasized than code-specialized families | Fully open (weights, code, data, training recipe). Apache 2.0 for base models; some instruct checkpoints have additional terms. |
Use one academic snapshot table, one real-work evaluation table, and one reliability table. If a model only looks good in one of those three, it is not production-ready for your team.
The real tradeoff is not "open is better" or "closed is better." It is whether you want control, customization, and privacy enough to take on the systems burden yourself.
| Dimension | Open / open-weight | Closed ecosystem |
|---|---|---|
| Control | Self-host, fine-tune, inspect, and route however you want. | Fastest path to strong capability with less systems work. |
| Cost model | Infrastructure, ops, and engineering replace per-token API pricing. | Usage-based pricing is simple but can become expensive at scale. |
| Privacy and data boundary | Best option when prompts, outputs, and logs must stay inside your environment. | Provider policy and retention controls matter more. |
| Customization | Adapters, quantization, routing, and domain tuning are the major advantages. | Prompting is easy; deep model customization is limited. |
| Operational burden | You own serving, evals, security, and reliability. | You inherit better managed infrastructure and usually better SLAs. |
| Best fit | Teams with repeatable workloads, privacy needs, or platform ambitions. | Teams optimizing for speed, simplicity, and managed frontier access. |
The fastest way to waste time in open AI is to choose models before you define the serving envelope. Pick the hardware tier first, then shortlist models that fit.
What fits: Qwen 3.5 4B/9B, Gemma 3 12B, gpt-oss-20b, lightweight coding models, SDXL
Best for: Local testing, lightweight RAG, first agents, image generation
Watch-outs: Do not expect comfortable 70B+ serving or serious open video production.
What fits: Qwen 3.5 27B/35B-A3B, Gemma 3 27B, gpt-oss-120b, Qwen3-Coder-Next, small video stacks
Best for: Serious private assistants, agentic coding, local experimentation with MoE models
Watch-outs: Open video is still slow and multi-step agent stacks need careful tuning.
What fits: Qwen 3.5 397B-A17B, Llama 4 Maverick, Mistral Large 3, DeepSeek-V3.2, concurrency-heavy inference
Best for: Internal copilots, agent platforms, multimodal services, governed deployment
Watch-outs: This is where reliability, governance, and evaluation become more important than raw model choice.
For most real teams, the 14B-32B band is the easiest place to get strong quality without crossing into difficult multi-GPU operations. Giant MoE systems make sense later, not first.
Hallucinations are only one part of the reliability story. Open models also fail through prompt sensitivity, poor tool arguments, visual grounding errors, license misunderstandings, and brittle long-context behavior.
The most common failures are fabricated facts, false confidence, stale knowledge, malformed JSON, and plausible-but-wrong code. Code models can also generate insecure or license-sensitive output.
Expect OCR misses, object misidentification, incorrect grounding, and overconfident descriptions of partially visible content.
The main problems are prompt drift, poor typography, inconsistent identity, and weak fine-grained control unless you add editing and conditioning tools.
The biggest issues remain temporal flicker, identity drift, motion incoherence, and long runtimes for short clips.
License fit is not cleanup work after the benchmark review. It is one of the first filters. Many teams waste time evaluating models they cannot legally or economically ship.
| License pattern | Best for | Examples | Watch-out |
|---|---|---|---|
| Apache 2.0 / MIT | Commercial deployment and broad integration | OLMo 2, Whisper, Qwen 3.5, Qwen3-Coder, Mistral Large 3, gpt-oss, FLUX.1 schnell | Still verify each model card; not every family uses the same license for every checkpoint. |
| Llama 4 Community License | Commercial use with strong ecosystem momentum | Llama 4 Scout, Llama 4 Maverick | Permissive for many uses, but it is not OSI-style open source. |
| Gemma terms / custom terms | Practical use when the model fits your needs | Gemma 3 | Do not assume "Google open model" means Apache-style freedom. |
| OpenRAIL / Responsible AI licenses | Creative or research use where restrictions are acceptable | Some Stable Diffusion family releases, BigCode OpenRAIL-M | Behavioral restrictions and downstream obligations can affect productization. |
| Community / revenue-threshold licenses | Early testing before full commercialization | Some Stability releases | Revenue thresholds and enterprise terms can change the total cost of ownership. |
| Non-commercial weight licenses | Research, experimentation, internal evaluation | Some FLUX variants, AudioCraft weights | This is a hard stop for many production uses. |
Treat every checkpoint as its own legal object. Do not assume the family name tells you the full commercial story.
Choosing a model without choosing a serving and evaluation stack is incomplete. The stack determines latency, batching, observability, and how painful future model swaps will be.
Best for: Fastest path to local testing
Strengths: Great for laptops, desktops, and quick internal prototypes.
Limits: Not the best fit for serious multi-user production serving.
Best for: High-throughput production inference
Strengths: Paged attention, strong batching behavior, and a mature serving ecosystem.
Limits: More ops-heavy than local tools.
Best for: NVIDIA-centric optimized serving
Strengths: Best when you want GPU-specific performance tuning at scale.
Limits: More specialized setup and infra assumptions.
Best for: Custom workflows and research flexibility
Strengths: Best ecosystem for model experimentation, adapters, and editing pipelines.
Limits: Requires more assembly than end-user desktop tools.
Best for: Creative image and video workflows
Strengths: Visual pipeline building, strong community extensions, easy iteration.
Limits: Operational governance is weaker than code-first stacks.
Best for: Agents, tool use, and workflow orchestration
Strengths: Useful abstractions for state, retrieval, and multi-step execution.
Limits: They do not fix weak evals or poor model choices for you.
Use these as default launch points, not as permanent architecture decisions.
Start with Qwen 3.5 9B/27B or Llama 4 Scout, run it through a small RAG layer, and measure task completion before chasing bigger models.
Start with Qwen3-Coder-Next or Qwen3-Coder-30B-A3B, then step up only if your evals show clear gains on your real repos.
Use SDXL or FLUX for image work first. Treat open video as an R&D lane, not your default production pipeline.
Prioritize license clarity, eval discipline, and serving fit over raw leaderboard rank. vLLM-class serving plus a Qwen 3.5 27B–35B or Llama 4 Scout is usually the right first step.
Pick one workflow, one evaluation harness, one hardware target, and three candidate models. Anything broader becomes expensive research theater.
This page is built from model cards, technical reports, official repositories, standards bodies, and tooling documentation. The goal is practical decision support, not hype-driven ranking.