Model Guide8 min readReviewed Apr 20, 2026

StepFun Step 3.5 Flash: 196B-Parameter Agent-First MoE Model (350 TPS)

StepFun Step 3.5 Flash, released in February 2026, is an agent-first Mixture-of-Experts model with 196.81 billion total parameters and only 11 billion active per token. It uses 288 routed experts per layer with top-8 activation, Multi-Token Prediction (MTP-3) that predicts 4 tokens per forward pass, and Sliding Window Attention with a 3:1 ratio. The model achieves up to 350 tokens per second inference speed and scores 74.4% on SWE-bench Verified and 97.3% on AIME 2025. It is released under the Apache 2.0 license.

Published Apr 19, 2026Updated Apr 20, 2026

196.81B total / 11B active MoE with 288 routed experts per layer and top-8 activation.
Multi-Token Prediction (MTP-3) predicts 4 tokens per forward pass for up to 350 TPS.
Scores 74.4% on SWE-bench Verified, 97.3% on AIME 2025, and 86.4% on LiveCodeBench-V6.
Apache 2.0 license with dramatically lower decoding cost than comparable models.

Quick note: This guide is based on public docs and release pages, but you should still verify current pricing, limits, supported tools, and region-specific billing on the official source before you pay, subscribe, or integrate.

Architecture: agent-first MoE with Multi-Token Prediction

StepFun Step 3.5 Flash is designed around a 45-layer Mixture-of-Experts architecture with a hidden dimension of 4096. Each layer contains 288 routed experts plus 1 shared expert, with top-8 activation selecting only 11 billion of the 196.81 billion total parameters per token. This sparse activation is what enables the model to achieve high throughput without massive compute requirements.

A key innovation is Multi-Token Prediction (MTP-3), which predicts 4 tokens per forward pass instead of the standard 1. Combined with Sliding Window Attention at a 3:1 ratio and a vocabulary of 128,896 tokens, the model reaches inference speeds of up to 350 tokens per second. The 256K context window supports long-form agent interactions and complex coding sessions.

Step 3.5 Flash official snapshot infographic — The official Step 3.5 Flash repo, pricing docs, and StepPlan overview make the speed-versus-cost story much easier to explain than generic benchmark slides. Source: Official Step 3.5 Flash GitHub repository.

Official Step 3.5 Flash GitHub page screenshot

Official screenshot

Step 3.5 Flash exposes architecture and benchmark rows directly on the official repo page

For public writing, the GitHub page is one of the best official assets because it makes the MoE architecture, benchmark table, and open-weight route visible without guesswork.

Strong image for explaining why Step 3.5 Flash is framed around speed and decoding efficiency.
Useful when readers want to verify that the model is an Apache 2.0 open release.

Source: Official Step 3.5 Flash GitHub repository.

StepFun Step 3.5 Flash architecture specifications
Parameter	Value
Total parameters	196.81B
Active parameters	11B
Architecture type	Mixture-of-Experts (MoE)
Routed experts per layer	288
Shared experts per layer	1
Experts activated per token	8 (top-8) + 1 shared
Layers	45
Hidden dimension	4096
Context window	256K tokens
Vocabulary size	128,896
Multi-Token Prediction	MTP-3 (4 tokens per forward pass)
Attention mechanism	Sliding Window Attention (3:1 ratio)
Max inference speed	Up to 350 TPS
License	Apache 2.0

Benchmarks: competitive at a fraction of the cost

Step 3.5 Flash delivers benchmark results that compete with much larger models. It scores 74.4% on SWE-bench Verified, 51.0% on Terminal-Bench 2.0, and 97.3% on AIME 2025 — the latter placing it among the top publicly reported scores on that benchmark. On LiveCodeBench-V6 it reaches 86.4%, and on GAIA (no file) it scores 84.5.

The BrowseComp score is 51.6 without context and 69.0 with context, showing that the model benefits significantly from augmented retrieval. The tau2-Bench score of 88.2 indicates strong tool-use and agent capabilities.

Step 3.5 Flash benchmarks vs competitors
Benchmark	Step 3.5 Flash	DeepSeek V3.2	Kimi K2.5	GLM-4.7	MiniMax M2.1	MiMo-V2 Flash
SWE-bench Verified	74.4%	—	76.8%	—	—	—
Terminal-Bench 2.0	51.0%	—	50.8%	—	—	—
tau2-Bench	88.2	—	—	—	—	—
BrowseComp	51.6 (69.0 w/ ctx)	—	—	—	—	—
AIME 2025	97.3%	—	96.1%	—	—	—
LiveCodeBench-V6	86.4%	—	85.0%	—	—	—
GAIA (no file)	84.5	—	—	—	—	—

Decoding cost advantage

One of the most compelling aspects of Step 3.5 Flash is its decoding efficiency. Because only 11 billion of 196.81 billion parameters are active per token, and Multi-Token Prediction generates 4 tokens per forward pass, the model achieves a dramatically lower decoding cost compared to models with similar capability.

Compared to DeepSeek V3.2, Step 3.5 Flash is approximately 6 times cheaper to decode. Compared to Kimi K2, it is approximately 18.9 times cheaper. This cost advantage makes it particularly attractive for high-volume production deployments where inference cost is a primary concern.

Relative decoding cost comparison (lower is better)
Model	Relative decoding cost
Step 3.5 Flash	1.0x (baseline)
DeepSeek V3.2	6.0x
Kimi K2	18.9x

Access routes: PAYG API pricing and Step Plan subscriptions

StepFun now publishes two distinct official buying routes for Step 3.5 Flash. The direct API route appears on the public pricing page, where `step-3.5-flash` is listed at 0.7 CNY per 1M uncached input tokens, 0.14 CNY cached input tokens, and 2.1 CNY output tokens. That is the cleanest public token-billing row for programmatic access.

The Step Plan route is different. The official Step Plan overview sells Step 3.5 Flash through subscription bundles for tool-first usage, with Flash Mini at ¥49/month, Flash Plus at ¥99/month, Flash Pro at ¥199/month, and Flash Max at ¥699/month. The same page also documents a dedicated Step Plan base URL and positions the plan around OpenClaw, Claude Code, Trae, Cursor, and related tools rather than raw PAYG token billing.

Step Plan uses a dedicated base URL: `https://api.stepfun.com/step_plan/v1`.
The Step Plan overview currently lists `step-3.5-flash-2603`, `step-3.5-flash`, and `stepaudio-2.5-tts` in the supported-model lineup.
The StepFun pricing page also lists StepAudio 2.5 TTS at ¥5.8 per 10,000 characters, which helps show the broader product surface without mixing TTS pricing into the Step 3.5 Flash token row.
For buyer-facing writing, keep the API PAYG route and Step Plan subscription route separate. They are not the same billing surface.

Official StepFun access routes for Step 3.5 Flash
Route	Public pricing or plan	Best fit	Notes
Direct API	¥0.7 input / ¥0.14 cached input / ¥2.1 output per 1M tokens	Builders who want PAYG API billing	Published on the StepFun pricing details page under `step-3.5-flash`.
Step Plan Flash Mini	¥49 / month	Light tool-first usage	Subscription route with 5-hour and weekly prompt quotas.
Step Plan Flash Plus	¥99 / month	Frequent coding and agent work	Same Step Plan route, higher quota tier.
Step Plan Flash Pro	¥199 / month	Heavy individual use	Same Step Plan route, larger quota tier.
Step Plan Flash Max	¥699 / month	Team or production-style usage	Highest public Step Plan tier.

BuyGLM shows package prices in USD. When a source page is published in CNY, the displayed value uses a fixed 1 USD = 8 CNY conversion and should still be checked against the live vendor page before payment.

Use Step 3.5 Flash when inference cost and speed matter as much as accuracy

The Apache 2.0 weights are the fastest path for experimentation. The API is the fastest path for production deployments that need high-throughput agentic workflows at low cost.

Open Step 3.5 Flash on GitHub Submit request

Sources and official links

Frequently asked questions

What makes Step 3.5 Flash different from other MoE models?

Step 3.5 Flash has an unusually low active-parameter ratio (11B out of 196.81B) and uses Multi-Token Prediction (MTP-3) to generate 4 tokens per forward pass. This combination yields up to 350 TPS and dramatically lower decoding costs compared to peers.

Is Step 3.5 Flash open source?

Yes. Step 3.5 Flash is released under the Apache 2.0 license. Weights are available on GitHub and Hugging Face.

How does Step 3.5 Flash compare on coding benchmarks?

It scores 74.4% on SWE-bench Verified and 51.0% on Terminal-Bench 2.0. While slightly below the top scorers on SWE-bench (e.g., K2.5 at 76.8%), it delivers competitive results at a fraction of the inference cost.

Architecture: agent-first MoE with Multi-Token Prediction

Step 3.5 Flash exposes architecture and benchmark rows directly on the official repo page

Benchmarks: competitive at a fraction of the cost

Decoding cost advantage

Access routes: PAYG API pricing and Step Plan subscriptions

Use Step 3.5 Flash when inference cost and speed matter as much as accuracy

Sources and official links

Frequently asked questions

Related guides

AI Coding Benchmarks 2026: Which Public Numbers You Can Actually Trust After Qwen3.6-Max and Kimi K2.6

Best AI Coding Plan in 2026? Choose by Route, Tool, and Budget

Kimi K2 and K2.5: Moonshot AI's Open-Source MoE Models (1T Parameters, 32B Active)