AI Coding Benchmarks 2026: Which Public Numbers You Can Actually Trust After Qwen3.6-Max and Kimi K2.6
The April 20 launches changed the shortlist. Qwen3.6-Max-Preview and Kimi K2.6 are now part of any serious benchmark-first comparison, but they do not publish the same kind of evidence. Kimi K2.6 ships a full public benchmark table. Qwen3.6-Max-Preview currently ships a stronger launch-post delta sheet and leaderboard claim than a full public model card. MiniMax still has the cleanest split between coding-first M2.5 and agent-first M2.7, GLM remains easiest to cite through official comparison tables, and MiMo is still better framed as a product story than a benchmark-matrix story.
- Qwen3.6-Max-Preview should be treated as a separate row from Qwen3.6-Plus, not as a renamed Plus score.
- Kimi K2.6 moves Moonshot from “good benchmark evidence” to a full public benchmark-table story with coding, agent, and vision rows.
- MiniMax M2.5 and M2.7 are still both necessary because they answer different benchmark questions.
- GLM public numbers are real, but the cleanest citation path is still a comparison table or Z.AI release note instead of one dense benchmark hub.
Which AI coding benchmark pages deserve space now?
Most benchmark roundups fail because they treat every launch post as the same kind of source. They are not. Some pages publish full tables. Some publish only deltas against a previous model. Some publish a leaderboard rank without a reusable benchmark sheet. The first job is to separate those source types before you compare the models.
As of April 21, 2026, Kimi K2.6 is one of the cleanest public benchmark pages in the Chinese model field because the English tech blog publishes a broad table spanning coding, agentic, reasoning, and vision tasks. Qwen3.6-Max-Preview is newer and more limited as a source object: the official launch post is still extremely useful, but it is best treated as a launch-post delta sheet layered on top of the older Qwen3.6-Plus release, not as a fully independent long-form model card.
The newest releases changed the shortlist
The newest benchmark pages matter for different reasons. Kimi K2.6 gives you a real row in a comparison table immediately. Qwen3.6-Max-Preview gives you launch-level evidence that the next Qwen tier is meaningfully above Plus on coding and tool benchmarks, but the evidence currently comes as relative gains and rank claims.
| Model | What the official source adds | What you can safely cite | What to qualify |
|---|---|---|---|
| Qwen3.6-Max-Preview | A new preview tier above Qwen3.6-Plus with a dedicated launch article | Six #1 claims plus deltas vs Plus: SkillsBench +9.9, SciCode +10.8, NL2Repo +5.0, Terminal-Bench 2.0 +3.8, SuperGPQA +2.3, ToolcallFormatIFBench +2.8 | Treat it as a launch-post delta sheet and route-specific preview, not as a full public model card |
| Kimi K2.6 | A full public benchmark table plus launch article and pricing page | SWE-Bench Verified 80.2, Terminal-Bench 2.0 66.7, SWE-Bench Pro 58.6, DeepSearchQA f1 92.5, BrowseComp (agent swarm) 86.3 | Do not mix Kimi Code membership pricing with K2.6 API pricing |
| MiniMax M2.5 | Still the cleanest coding-first benchmark page | SWE-Bench Verified 80.2, Multi-SWE-Bench 51.3, BrowseComp 76.3 | It is still not the same story as M2.7 |
| MiniMax M2.7 | Still the cleanest agent-first MiniMax benchmark page | SWE-Pro 56.22, Terminal Bench 2.0 57.0, NL2Repo 39.8 | It is not a simple replacement for M2.5 |
| GLM-5 / GLM-5.1 | Still sourceable through official release pages and comparison tables | SWE-Bench Verified 77.8, Terminal-Bench 56.2, GLM-5.1 SWE-Bench Pro 58.4 | Keep the source owner visible when you cite a comparison-table row |
| MiMo-V2-Pro | Still strongest as a release-and-product story | Long context, agent positioning, and integration coverage | Do not force it into a same-format benchmark table if the source does not provide one |
The benchmark numbers that are easiest to cite
| Model | Best public signals | Best use in an article | What to qualify |
|---|---|---|---|
| Qwen3.6-Max-Preview | Six #1 launch claims plus benchmark deltas over Qwen3.6-Plus | A “what changed above Plus?” section or a launch-week frontier roundup | Do not write as if it has the same type of full public score sheet as Kimi K2.6 |
| Qwen 3.6 / Qwen 3.6-Plus | SWE-Bench Verified 78.8, Terminal-Bench 61.6, MCPMark 48.2 | Benchmark-first and CLI-first coverage | Do not confuse the general Qwen 3.6 family with the 3.6-Plus row used in the comparison table |
| Kimi K2.6 | SWE-Bench Verified 80.2, Terminal-Bench 66.7, SWE-Bench Pro 58.6, DeepSearchQA 92.5 | A full benchmark + launch story with agent, coding, and workflow evidence | Keep Kimi Code membership, API pricing, and Agent Swarm product pages separate |
| Kimi K2.5 | SWE-Bench Verified 76.8, Terminal-Bench 50.8, LiveCodeBench v6 85.0 | A balanced benchmark + product story | Kimi Code and Open Platform are separate products |
| MiniMax M2.5 | SWE-Bench Verified 80.2, Multi-SWE-Bench 51.3, BrowseComp 76.3 | Pure benchmark competitiveness | M2.5 and M2.7 tell different stories |
| MiniMax M2.7 | SWE-Pro 56.22, Terminal-Bench 57.0, NL2Repo 39.8 | Agent and terminal workflow positioning | Do not treat it as a simple M2.5 replacement |
| GLM5 / GLM-5.1 | SWE-Bench Verified 77.8, Terminal-Bench 56.2, MCPMark 31.1 | A fair comparison row if you label the source correctly | The easiest public citation path is still Qwen’s official comparison table |
| MiMo-V2-Pro | No matching public benchmark grid; stronger release and integration signals | Product, long-context, and integration coverage | Avoid forcing it into a benchmark table that the official material does not support |
These are the public numbers most likely to survive source checks in an external article.
Official Kimi K2.6 tech blog.
Official MiniMax M2.5 release.
Official Qwen 3.6 release.
Shown in Qwen’s official comparison table.
Official Kimi K2.5 technical blog.
Qwen3.6-Max-Preview is intentionally omitted here because the official launch page currently exposes deltas and #1 claims more clearly than a reusable public SWE-Bench Verified row. Source: Official Kimi K2.6 tech blog.
This is the more useful chart when readers care about multi-step terminal work, tools, and agents.
Official Kimi K2.6 tech blog.
Official Qwen 3.6 release.
Official MiniMax M2.7 release.
Shown in Qwen’s official comparison table.
Official Kimi K2.5 technical blog.
The Qwen3.6-Max-Preview launch post claims a +3.8 gain over Qwen3.6-Plus on Terminal-Bench 2.0, so it belongs in the discussion even when the chart uses older absolute Qwen rows. Source: Official Kimi K2.6 tech blog.
How to use benchmark coverage without over-reading it
Benchmark pages are useful because they narrow the field quickly. They are not the final buying page. Once you know which providers have public evidence worth trusting, the next step is pricing, route design, tool support, and setup friction.
That is especially important after the April 20 launches. Qwen3.6-Max-Preview has a stronger launch delta story than a fully mature public route story. Kimi K2.6 has the opposite advantage: a mature benchmark table plus clear pricing. Those differences matter in a buyer-facing roundup.
- Use benchmark tables to shortlist providers, not to make the final purchase decision.
- Separate “full public score table” sources from “launch-post delta” sources.
- Use tool docs and pricing pages to decide what actually fits your workflow.
- Treat social posts as pointers back to official cards, not as primary sources.
Use benchmark pages to shortlist, then move to pricing and setup docs
This guide is the first pass. The next pass is always the provider’s pricing page, tool docs, and usage limits.
Sources and official links
Frequently asked questions
Which providers are easiest to cover in a benchmark-first article right now?
Kimi K2.6, MiniMax M2.5, MiniMax M2.7, and Qwen 3.6-Plus are the cleanest if you want reusable public tables. Qwen3.6-Max-Preview is highly relevant too, but it is best cited as a launch-post delta sheet rather than as a full benchmark card.
Can GLM still be included in benchmark coverage?
Yes, but the safest public citation path is Qwen’s official comparison table, where GLM5 benchmark rows are visible. That distinction matters in an external article.
Why is Qwen3.6-Max-Preview handled differently from Kimi K2.6 here?
Because the current source objects are different. Kimi K2.6 publishes a wide benchmark table in its official English tech blog. Qwen3.6-Max-Preview currently publishes benchmark deltas, #1 claims, and pricing-route evidence in the launch post and Model Studio docs. Both are useful, but they are not the same kind of citation object.
Why is MiMo not ranked in the same way here?
Because the official material is shaped differently. MiMo’s strongest public evidence is its release positioning, long context window, and integration coverage, not a public benchmark matrix that mirrors Qwen, Kimi, or MiniMax.