| Writing / general LLM | Chat, drafting, summarization, RAG, internal copilots | Qwen3.6-35B-A3B or Llama 4 Scout | Qwen 3.5 397B-A17B, Llama 4 Maverick, Mistral Large 3, DeepSeek-V3.2 | Qwen3.6-35B-A3B or gpt-oss-20b covers most team needs without datacenter overhead. | License terms (Llama 4 EU multimodal limits) and quantization quality matter more than leaderboard hype. |
| Coding | PR assistance, code generation, refactors, test writing, coding agents | Qwen3-Coder-Next | Qwen3-Coder-480B-A35B, Mistral Large 3, or DeepSeek-V3.2 | Qwen3-Coder-Next (80B total / 3B active, 256K context) is the most practical for real developer use. | Do not deploy without tests, sandboxing, and dependency/security review. |
| Agents | Tool use, workflow automation, multi-step task execution | Qwen3.6-35B-A3B or Llama 4 Scout | Llama 4 Maverick, Mistral Large 3, or DeepSeek-V3.2 | Smaller models plus strong orchestration often beat oversized models with weak tooling. | JSON breakage, tool misuse, and cascading failures are the real bottlenecks. Avoid DeepSeek V3.2-Speciale here — it drops tool calling. |
| Multimodal / VLM | Document understanding, image Q&A, visual agents, OCR-heavy workflows | Gemma 4 (E4B or 26B MoE) or Qwen3.6-35B-A3B | Qwen 3.5 397B-A17B or Llama 4 Maverick | Gemma 4 26B MoE is the most practical local starting point with up to 256K context. | Grounding mistakes and OCR hallucinations still require checks; for true visual grounding, prefer Molmo 2. |
| Omnimodal (text + vision + speech) | Voice-first assistants, real-time audio-visual reasoning, streaming speech generation | MiniCPM-o 4.5 (9B) for local real-time | Qwen3.5-Omni for full audio-visual reasoning at datacenter scale | MiniCPM-o 4.5 gives most teams a real-time omni model that fits on prosumer hardware. | Some serving stacks still need patched support; streaming speech latency is the binding constraint. |
| Video grounding / pointing | Video understanding, object pointing, tracking, multi-image reasoning | Molmo 2 (4B, 8B, O-7B variants) | Custom Molmo 2 fine-tunes on domain data | The only open family in 2026 shipping true open-data video grounding without distillation from proprietary VLMs. | Not a general chat model — pair with a chat-capable LLM for conversational interfaces. |
| Edge multimodal | Phones, IoT, lightweight servers, on-device VLM workloads | MiniCPM-V 4.6 (1.3B) or Gemma 4 E2B | MiniCPM-o 4.5 or Gemma 4 E4B when more capability is required | MiniCPM-V 4.6 punches above its weight for visual tasks on 4–8GB devices. | Capability ceiling is real — do not expect 27B-class reasoning at 1.3B. |
| Image generation | Concept art, marketing assets, ideation, product visuals | SDXL | FLUX.1-schnell (Apache 2.0) or other FLUX variants when licensing permits | SDXL remains the safest default for ecosystem compatibility; FLUX.1-schnell when you need speed and a permissive license. | Typography and exact prompt fidelity still need workflow iteration. FLUX licensing is checkpoint-specific. |
| Image editing | Inpainting, control, masking, pose/depth guidance, product edits | ControlNet + SAM 2 + LaMa + Diffusers | Project-specific editing stacks with custom masks and pipelines | Editing quality comes from stack design, not one magic checkpoint. | Commercial rights differ across base checkpoints and extensions. |
| Speech recognition | Transcription, translation, voice interfaces, audio understanding | Whisper large-v3 | Whisper turbo for faster streaming, or Qwen3.5-Omni for unified audio + text reasoning | Whisper large-v3 covers most transcription and translation needs; turbo is the official low-latency derivative. | AudioCraft code is MIT but its model weights are CC-BY-NC 4.0 — not usable for most commercial audio generation. |
| Speech synthesis (TTS) | Voice agents, narration, dubbing, expressive synthesis | Chatterbox-Turbo (350M) or Chatterbox-Multilingual (500M) | Domain-tuned Chatterbox variants | The cleanest open TTS row in 2026, with modern voice cloning and expressive controls. | Voice cloning has ethical and legal implications. Confirm consent rules before deploying. |
| Video generation | Short exploratory clips, motion concepts, early creative prototyping | Wan2.1 small variants | Wan2.1 family extensions (FLF2V, VACE) or Open-Sora 2.0 (11B, Apache 2.0) | Today, open video is a prototyping tool more than a production default. | Temporal flicker, identity drift, and long render times remain common. Verify the repo LICENSE before commercial use. |
| Video editing | Interpolation, inpainting, retiming, experimental edit pipelines | RIFE, ProPainter, Wan2.1 VACE-style workflows | Custom pipelines for domain-specific video tasks | Use specialized tools rather than expecting one general model to handle everything. | Workflow complexity is high; results are sensitive to clip quality and masking. Some tools (e.g., ProPainter) ship under research-only S-Lab terms — verify license fit. |