Benchmarks9 min readReviewed Apr 21, 2026

AI Coding Benchmarks 2026: Which Public Numbers You Can Actually Trust After Qwen3.6-Max and Kimi K2.6

The April 20 launches changed the shortlist. Qwen3.6-Max-Preview and Kimi K2.6 are now part of any serious benchmark-first comparison, but they do not publish the same kind of evidence. Kimi K2.6 ships a full public benchmark table. Qwen3.6-Max-Preview currently ships a stronger launch-post delta sheet and leaderboard claim than a full public model card. MiniMax still has the cleanest split between coding-first M2.5 and agent-first M2.7, GLM remains easiest to cite through official comparison tables, and MiMo is still better framed as a product story than a benchmark-matrix story.

Published Apr 19, 2026Updated Apr 21, 2026

Qwen3.6-Max-Preview should be treated as a separate row from Qwen3.6-Plus, not as a renamed Plus score.
Kimi K2.6 moves Moonshot from “good benchmark evidence” to a full public benchmark-table story with coding, agent, and vision rows.
MiniMax M2.5 and M2.7 are still both necessary because they answer different benchmark questions.
GLM public numbers are real, but the cleanest citation path is still a comparison table or Z.AI release note instead of one dense benchmark hub.

Quick note: This guide is based on public docs and release pages, but you should still verify current pricing, limits, supported tools, and region-specific billing on the official source before you pay, subscribe, or integrate.

Which AI coding benchmark pages deserve space now?

Most benchmark roundups fail because they treat every launch post as the same kind of source. They are not. Some pages publish full tables. Some publish only deltas against a previous model. Some publish a leaderboard rank without a reusable benchmark sheet. The first job is to separate those source types before you compare the models.

As of April 21, 2026, Kimi K2.6 is one of the cleanest public benchmark pages in the Chinese model field because the English tech blog publishes a broad table spanning coding, agentic, reasoning, and vision tasks. Qwen3.6-Max-Preview is newer and more limited as a source object: the official launch post is still extremely useful, but it is best treated as a launch-post delta sheet layered on top of the older Qwen3.6-Plus release, not as a fully independent long-form model card.

2026 coding benchmark landscape infographic — A quick view of the benchmark pages that are easiest to cite in public-facing coverage. Source: Official Qwen 3.6 release.

The newest releases changed the shortlist

The newest benchmark pages matter for different reasons. Kimi K2.6 gives you a real row in a comparison table immediately. Qwen3.6-Max-Preview gives you launch-level evidence that the next Qwen tier is meaningfully above Plus on coding and tool benchmarks, but the evidence currently comes as relative gains and rank claims.

What changed after the April 20 launches
Model	What the official source adds	What you can safely cite	What to qualify
Qwen3.6-Max-Preview	A new preview tier above Qwen3.6-Plus with a dedicated launch article	Six #1 claims plus deltas vs Plus: SkillsBench +9.9, SciCode +10.8, NL2Repo +5.0, Terminal-Bench 2.0 +3.8, SuperGPQA +2.3, ToolcallFormatIFBench +2.8	Treat it as a launch-post delta sheet and route-specific preview, not as a full public model card
Kimi K2.6	A full public benchmark table plus launch article and pricing page	SWE-Bench Verified 80.2, Terminal-Bench 2.0 66.7, SWE-Bench Pro 58.6, DeepSearchQA f1 92.5, BrowseComp (agent swarm) 86.3	Do not mix Kimi Code membership pricing with K2.6 API pricing
MiniMax M2.5	Still the cleanest coding-first benchmark page	SWE-Bench Verified 80.2, Multi-SWE-Bench 51.3, BrowseComp 76.3	It is still not the same story as M2.7
MiniMax M2.7	Still the cleanest agent-first MiniMax benchmark page	SWE-Pro 56.22, Terminal Bench 2.0 57.0, NL2Repo 39.8	It is not a simple replacement for M2.5
GLM-5 / GLM-5.1	Still sourceable through official release pages and comparison tables	SWE-Bench Verified 77.8, Terminal-Bench 56.2, GLM-5.1 SWE-Bench Pro 58.4	Keep the source owner visible when you cite a comparison-table row
MiMo-V2-Pro	Still strongest as a release-and-product story	Long context, agent positioning, and integration coverage	Do not force it into a same-format benchmark table if the source does not provide one

The benchmark numbers that are easiest to cite

Public benchmark evidence as of April 21, 2026
Model	Best public signals	Best use in an article	What to qualify
Qwen3.6-Max-Preview	Six #1 launch claims plus benchmark deltas over Qwen3.6-Plus	A “what changed above Plus?” section or a launch-week frontier roundup	Do not write as if it has the same type of full public score sheet as Kimi K2.6
Qwen 3.6 / Qwen 3.6-Plus	SWE-Bench Verified 78.8, Terminal-Bench 61.6, MCPMark 48.2	Benchmark-first and CLI-first coverage	Do not confuse the general Qwen 3.6 family with the 3.6-Plus row used in the comparison table
Kimi K2.6	SWE-Bench Verified 80.2, Terminal-Bench 66.7, SWE-Bench Pro 58.6, DeepSearchQA 92.5	A full benchmark + launch story with agent, coding, and workflow evidence	Keep Kimi Code membership, API pricing, and Agent Swarm product pages separate
Kimi K2.5	SWE-Bench Verified 76.8, Terminal-Bench 50.8, LiveCodeBench v6 85.0	A balanced benchmark + product story	Kimi Code and Open Platform are separate products
MiniMax M2.5	SWE-Bench Verified 80.2, Multi-SWE-Bench 51.3, BrowseComp 76.3	Pure benchmark competitiveness	M2.5 and M2.7 tell different stories
MiniMax M2.7	SWE-Pro 56.22, Terminal-Bench 57.0, NL2Repo 39.8	Agent and terminal workflow positioning	Do not treat it as a simple M2.5 replacement
GLM5 / GLM-5.1	SWE-Bench Verified 77.8, Terminal-Bench 56.2, MCPMark 31.1	A fair comparison row if you label the source correctly	The easiest public citation path is still Qwen’s official comparison table
MiMo-V2-Pro	No matching public benchmark grid; stronger release and integration signals	Product, long-context, and integration coverage	Avoid forcing it into a benchmark table that the official material does not support

SWE-Bench Verified

These are the public numbers most likely to survive source checks in an external article.

Kimi K2.680.2

Official Kimi K2.6 tech blog.

MiniMax M2.580.2

Official MiniMax M2.5 release.

Qwen 3.6-Plus78.8

Official Qwen 3.6 release.

GLM577.8

Shown in Qwen’s official comparison table.

Kimi K2.576.8

Official Kimi K2.5 technical blog.

Qwen3.6-Max-Preview is intentionally omitted here because the official launch page currently exposes deltas and #1 claims more clearly than a reusable public SWE-Bench Verified row. Source: Official Kimi K2.6 tech blog.

Terminal-Bench / agent workflow signal

This is the more useful chart when readers care about multi-step terminal work, tools, and agents.

Kimi K2.666.7

Official Kimi K2.6 tech blog.

Qwen 3.6-Plus61.6

Official Qwen 3.6 release.

MiniMax M2.757.0

Official MiniMax M2.7 release.

GLM556.2

Shown in Qwen’s official comparison table.

Kimi K2.550.8

Official Kimi K2.5 technical blog.

The Qwen3.6-Max-Preview launch post claims a +3.8 gain over Qwen3.6-Plus on Terminal-Bench 2.0, so it belongs in the discussion even when the chart uses older absolute Qwen rows. Source: Official Kimi K2.6 tech blog.

How to use benchmark coverage without over-reading it

Benchmark pages are useful because they narrow the field quickly. They are not the final buying page. Once you know which providers have public evidence worth trusting, the next step is pricing, route design, tool support, and setup friction.

That is especially important after the April 20 launches. Qwen3.6-Max-Preview has a stronger launch delta story than a fully mature public route story. Kimi K2.6 has the opposite advantage: a mature benchmark table plus clear pricing. Those differences matter in a buyer-facing roundup.

Use benchmark tables to shortlist providers, not to make the final purchase decision.
Separate “full public score table” sources from “launch-post delta” sources.
Use tool docs and pricing pages to decide what actually fits your workflow.
Treat social posts as pointers back to official cards, not as primary sources.

Use benchmark pages to shortlist, then move to pricing and setup docs

This guide is the first pass. The next pass is always the provider’s pricing page, tool docs, and usage limits.

Open the Kimi K2.6 tech blog Submit request

Sources and official links

Frequently asked questions

Which providers are easiest to cover in a benchmark-first article right now?

Kimi K2.6, MiniMax M2.5, MiniMax M2.7, and Qwen 3.6-Plus are the cleanest if you want reusable public tables. Qwen3.6-Max-Preview is highly relevant too, but it is best cited as a launch-post delta sheet rather than as a full benchmark card.

Can GLM still be included in benchmark coverage?

Yes, but the safest public citation path is Qwen’s official comparison table, where GLM5 benchmark rows are visible. That distinction matters in an external article.

Why is Qwen3.6-Max-Preview handled differently from Kimi K2.6 here?

Because the current source objects are different. Kimi K2.6 publishes a wide benchmark table in its official English tech blog. Qwen3.6-Max-Preview currently publishes benchmark deltas, #1 claims, and pricing-route evidence in the launch post and Model Studio docs. Both are useful, but they are not the same kind of citation object.

Why is MiMo not ranked in the same way here?

Because the official material is shaped differently. MiMo’s strongest public evidence is its release positioning, long context window, and integration coverage, not a public benchmark matrix that mirrors Qwen, Kimi, or MiniMax.

Which AI coding benchmark pages deserve space now?

The newest releases changed the shortlist

The benchmark numbers that are easiest to cite

How to use benchmark coverage without over-reading it

Use benchmark pages to shortlist, then move to pricing and setup docs

Sources and official links

Frequently asked questions

Related guides

Qwen3.6-Max-Preview: Alibaba's New Coding-Preview Tier Above Qwen3.6-Plus

Kimi K2.6: Open-Source Coding, 300-Agent Swarms, and 80.2 SWE-Bench Verified

MiniMax M2.5 / M2.7 vs GLM-5.1 for Coding: Benchmarks vs Buying Reality

GLM-5.1 vs Qwen 3.6-Plus: Which Is Better for Agentic Coding?