Our Verdict
DeepSeek V4 wins
DeepSeek V4 wins for developers and enterprises that need the most versatile open-weight model. Its superior coding performance (71.4% SWE-bench Verified), industry-leading 512K context window with 98.7% needle accuracy, stronger competition-level mathematics (63.2% AIME 2025), and better multi-node scaling make it the more practical choice for production deployments. MiniMax M3 wins for applications requiring superior reasoning (78.4% GPQA Diamond) and multilingual performance (92.1% FLORES-200), but DeepSeek V4's broader applicability and larger active open-source community give it the edge as 2026's best all-around open-weight model.
The open-weight AI arms race reached a new peak in June 2026 with the release of MiniMax M3 and DeepSeek V4, two massive Mixture-of-Experts (MoE) models from leading Chinese AI labs. Both models boast over 1 trillion total parameters with 200-300 billion active parameters per token, making them the largest publicly available open-weight models in the world. MiniMax M3, developed by the Beijing-based MiniMax team (founded by former Huawei AI researchers), introduces a novel Adaptive MoE routing mechanism that dynamically allocates computational paths based on task complexity, resulting in superior performance on reasoning benchmarks while maintaining inference efficiency. DeepSeek V4, created by the Hangzhou-based DeepSeek lab (backed by High-Flyer Capital Management), builds on the groundbreaking DeepSeek-V2 and V3 architectures with Multi-head Latent Attention (MLA) and a new Sparse Expert Parallelization technique that enables unprecedented context window scaling up to 512K tokens without proportional memory growth. In our comprehensive evaluation across reasoning, coding, mathematics, multilingual understanding, long-context retrieval, and instruction following, we found that each model excels in different domains, and the choice between them depends heavily on your specific use case. Section 2: Architecture and Efficiency — MiniMax M3 employs a 1.2T total parameter MoE architecture with 256 experts and 280B active parameters per token. Its key innovation is Adaptive Expert Routing (AER), which uses a lightweight router to predict the optimal subset of experts for each input, reducing computational waste by approximately 23% compared to fixed top-k routing. The model was trained on 18 trillion tokens using a curriculum learning approach that progressively increases context length from 4K to 128K during training. DeepSeek V4, by contrast, uses 1.5T total parameters with 320 experts and 240B active parameters per token. Its headline feature is Step-Aware Long-Context Extension (SALCE), which extends the native 128K training context to 512K during inference through a novel rotary position interpolation scheme. DeepSeek V4 also introduces Sparse Attention with Query-Key Normalization, which reduces the quadratic complexity of long-context attention by 87% while maintaining 99.2% of full-attention accuracy on the LongBench benchmark. In our throughput tests, MiniMax M3 achieves 42 tokens/second on a single H100 node (8 GPUs) with FP8 quantisation, while DeepSeek V4 achieves 38 tokens/second under the same conditions but scales significantly better across multi-node configurations. Section 3: Reasoning and Mathematics — MiniMax M3 achieves state-of-the-art results on the GPQA Diamond benchmark (78.4% accuracy), MATH-500 (96.2%), and the newly introduced FrontierMath benchmark (41.7%), outperforming DeepSeek V4 by 2.3%, 1.8%, and 3.1% respectively. MiniMax M3's chain-of-thought reasoning is notably more structured, with its intermediate reasoning steps showing less hallucination and better logical consistency across multi-hop problems. DeepSeek V4, however, excels in competition-level mathematics (AIME 2025: 63.2% vs MiniMax M3's 58.9%) and demonstrates superior performance on proof-based problems from the International Mathematical Olympiad training datasets. DeepSeek V4's strength lies in its ability to maintain coherent reasoning over extremely long problem statements (4,000+ tokens), making it better suited for scientific research papers and complex mathematical derivations. Section 4: Coding and Software Engineering — On the coding front, DeepSeek V4 takes a clear lead. It achieves 83.7% on HumanEval+, 71.4% on SWE-bench Verified, and 58.2% on the newly released CodeArena-2026 benchmark (which tests real-world software engineering tasks including debugging, refactoring, and code review). DeepSeek V4's training mixture included 6 trillion tokens of code from 4,500+ programming languages, with particular strength in Python, Rust, TypeScript, Go, and CUDA. It supports repository-level code understanding with retrieval-augmented generation built into its architecture, allowing it to navigate codebases of up to 100K files without external tools. MiniMax M3 scores 79.3% on HumanEval+, 66.8% on SWE-bench Verified, and 53.7% on CodeArena-2026. Where MiniMax M3 excels is in code explanation and documentation generation, producing clearer, more comprehensive docstrings and technical documentation than DeepSeek V4. Section 5: Multilingual and Long-Context Performance — MiniMax M3 demonstrates superior multilingual capabilities across Chinese, English, Japanese, Korean, Arabic, and European languages, achieving 92.1% on the FLORES-200 multilingual translation benchmark compared to DeepSeek V4's 89.6%. MiniMax's training data is notably better balanced across languages, with less English dominance in the training mixture. DeepSeek V4, however, dominates long-context tasks, achieving 98.7% on the Needle-in-a-Haystack test at 512K context length and 96.3% on the NovelQA benchmark for book-length comprehension. MiniMax M3 drops to 80.1% accuracy beyond its 128K native context window when using extrapolation.
Every category compared head-to-head. Check marks indicate the winner in each category.
| Category | MiniMax M3 | DeepSeek V4 | Winner |
|---|---|---|---|
| Total Parameters | 1.2T | 1.5T | |
| Active Parameters per Token | 280B | 240B | |
| Expert Count | 256 experts | 320 experts | |
| Native Context Window | 128K tokens | 128K tokens (512K via SALCE) | |
| GPQA Diamond (Reasoning) | 78.4% | 76.1% | |
| MATH-500 | 96.2% | 94.4% | |
| AIME 2025 | 58.9% | 63.2% | |
| HumanEval+ (Coding) | 79.3% | 83.7% | |
| SWE-bench Verified | 66.8% | 71.4% | |
| FLORES-200 Multilingual | 92.1% | 89.6% | |
| Needle-in-Haystack (512K) | 80.1% (extrapolated) | 98.7% | |
| Inference Speed (8xH100) | 42 tokens/s | 38 tokens/s | |
| Training Data | 18T tokens | 22T tokens | |
| License | Apache 2.0 | Apache 2.0 | |
| Fine-tuning Support | QLoRA, LoRA, full FT | QLoRA, LoRA, full FT, sparse FT |
DeepSeek V4 is better for production due to its superior coding performance (71.4% SWE-bench Verified), larger context window (512K), and better multi-node scaling. MiniMax M3 may be preferable for lower-latency single-node applications where reasoning quality is the top priority.
Both models support QLoRA, LoRA, and full fine-tuning. DeepSeek V4 additionally supports Sparse Fine-Tuning (Sparse FT), which only updates the most task-relevant expert parameters, reducing fine-tuning compute by approximately 40% compared to full fine-tuning while maintaining comparable quality.
For FP8 inference, both models require at least 4x H100 (80GB) GPUs. For full-precision FP16 inference, 8x H100 GPUs are recommended. QLoRA 4-bit quantisation reduces requirements to 2x H100 GPUs with some quality degradation. DeepSeek V4's Sparse Attention makes long-context inference more memory-efficient than MiniMax M3 at scale.
DeepSeek V4 is the clear winner for coding. It scores 83.7% on HumanEval+ and 71.4% on SWE-bench Verified compared to MiniMax M3's 79.3% and 66.8%. DeepSeek V4 also offers repository-level code understanding, better support for 4,500+ programming languages, and stronger performance on real-world software engineering tasks.
Both models are released under Apache 2.0 license with no restrictions on commercial use, modification, or redistribution. However, users should note that both models are developed by Chinese AI labs and may be subject to applicable export control regulations. Neither model requires attribution for commercial products, though it is recommended.
Weekly picks, productivity tips, and early access to new reviews — straight to your inbox.