Model Guide8 min readReviewed Apr 20, 2026

StepFun Step 3.5 Flash: 196B-Parameter Agent-First MoE Model (350 TPS)

StepFun Step 3.5 Flash, released in February 2026, is an agent-first Mixture-of-Experts model with 196.81 billion total parameters and only 11 billion active per token. It uses 288 routed experts per layer with top-8 activation, Multi-Token Prediction (MTP-3) that predicts 4 tokens per forward pass, and Sliding Window Attention with a 3:1 ratio. The model achieves up to 350 tokens per second inference speed and scores 74.4% on SWE-bench Verified and 97.3% on AIME 2025. It is released under the Apache 2.0 license.

Published Apr 19, 2026Updated Apr 20, 2026
  • 196.81B total / 11B active MoE with 288 routed experts per layer and top-8 activation.
  • Multi-Token Prediction (MTP-3) predicts 4 tokens per forward pass for up to 350 TPS.
  • Scores 74.4% on SWE-bench Verified, 97.3% on AIME 2025, and 86.4% on LiveCodeBench-V6.
  • Apache 2.0 license with dramatically lower decoding cost than comparable models.
Quick note: This guide is based on public docs and release pages, but you should still verify current pricing, limits, supported tools, and region-specific billing on the official source before you pay, subscribe, or integrate.

Architecture: agent-first MoE with Multi-Token Prediction

StepFun Step 3.5 Flash is designed around a 45-layer Mixture-of-Experts architecture with a hidden dimension of 4096. Each layer contains 288 routed experts plus 1 shared expert, with top-8 activation selecting only 11 billion of the 196.81 billion total parameters per token. This sparse activation is what enables the model to achieve high throughput without massive compute requirements.

A key innovation is Multi-Token Prediction (MTP-3), which predicts 4 tokens per forward pass instead of the standard 1. Combined with Sliding Window Attention at a 3:1 ratio and a vocabulary of 128,896 tokens, the model reaches inference speeds of up to 350 tokens per second. The 256K context window supports long-form agent interactions and complex coding sessions.

Step 3.5 Flash official snapshot infographic
The official Step 3.5 Flash repo, pricing docs, and StepPlan overview make the speed-versus-cost story much easier to explain than generic benchmark slides. Source: Official Step 3.5 Flash GitHub repository.
Official Step 3.5 Flash GitHub page screenshot

Official screenshot

Step 3.5 Flash exposes architecture and benchmark rows directly on the official repo page

For public writing, the GitHub page is one of the best official assets because it makes the MoE architecture, benchmark table, and open-weight route visible without guesswork.

  • Strong image for explaining why Step 3.5 Flash is framed around speed and decoding efficiency.
  • Useful when readers want to verify that the model is an Apache 2.0 open release.

Source: Official Step 3.5 Flash GitHub repository.

StepFun Step 3.5 Flash architecture specifications
ParameterValue
Total parameters196.81B
Active parameters11B
Architecture typeMixture-of-Experts (MoE)
Routed experts per layer288
Shared experts per layer1
Experts activated per token8 (top-8) + 1 shared
Layers45
Hidden dimension4096
Context window256K tokens
Vocabulary size128,896
Multi-Token PredictionMTP-3 (4 tokens per forward pass)
Attention mechanismSliding Window Attention (3:1 ratio)
Max inference speedUp to 350 TPS
LicenseApache 2.0

Benchmarks: competitive at a fraction of the cost

Step 3.5 Flash delivers benchmark results that compete with much larger models. It scores 74.4% on SWE-bench Verified, 51.0% on Terminal-Bench 2.0, and 97.3% on AIME 2025 — the latter placing it among the top publicly reported scores on that benchmark. On LiveCodeBench-V6 it reaches 86.4%, and on GAIA (no file) it scores 84.5.

The BrowseComp score is 51.6 without context and 69.0 with context, showing that the model benefits significantly from augmented retrieval. The tau2-Bench score of 88.2 indicates strong tool-use and agent capabilities.

Step 3.5 Flash benchmarks vs competitors
BenchmarkStep 3.5 FlashDeepSeek V3.2Kimi K2.5GLM-4.7MiniMax M2.1MiMo-V2 Flash
SWE-bench Verified74.4%76.8%
Terminal-Bench 2.051.0%50.8%
tau2-Bench88.2
BrowseComp51.6 (69.0 w/ ctx)
AIME 202597.3%96.1%
LiveCodeBench-V686.4%85.0%
GAIA (no file)84.5

Decoding cost advantage

One of the most compelling aspects of Step 3.5 Flash is its decoding efficiency. Because only 11 billion of 196.81 billion parameters are active per token, and Multi-Token Prediction generates 4 tokens per forward pass, the model achieves a dramatically lower decoding cost compared to models with similar capability.

Compared to DeepSeek V3.2, Step 3.5 Flash is approximately 6 times cheaper to decode. Compared to Kimi K2, it is approximately 18.9 times cheaper. This cost advantage makes it particularly attractive for high-volume production deployments where inference cost is a primary concern.

Relative decoding cost comparison (lower is better)
ModelRelative decoding cost
Step 3.5 Flash1.0x (baseline)
DeepSeek V3.26.0x
Kimi K218.9x

Access routes: PAYG API pricing and Step Plan subscriptions

StepFun now publishes two distinct official buying routes for Step 3.5 Flash. The direct API route appears on the public pricing page, where `step-3.5-flash` is listed at 0.7 CNY per 1M uncached input tokens, 0.14 CNY cached input tokens, and 2.1 CNY output tokens. That is the cleanest public token-billing row for programmatic access.

The Step Plan route is different. The official Step Plan overview sells Step 3.5 Flash through subscription bundles for tool-first usage, with Flash Mini at ¥49/month, Flash Plus at ¥99/month, Flash Pro at ¥199/month, and Flash Max at ¥699/month. The same page also documents a dedicated Step Plan base URL and positions the plan around OpenClaw, Claude Code, Trae, Cursor, and related tools rather than raw PAYG token billing.

  • Step Plan uses a dedicated base URL: `https://api.stepfun.com/step_plan/v1`.
  • The Step Plan overview currently lists `step-3.5-flash-2603`, `step-3.5-flash`, and `stepaudio-2.5-tts` in the supported-model lineup.
  • The StepFun pricing page also lists StepAudio 2.5 TTS at ¥5.8 per 10,000 characters, which helps show the broader product surface without mixing TTS pricing into the Step 3.5 Flash token row.
  • For buyer-facing writing, keep the API PAYG route and Step Plan subscription route separate. They are not the same billing surface.
Official StepFun access routes for Step 3.5 Flash
RoutePublic pricing or planBest fitNotes
Direct API¥0.7 input / ¥0.14 cached input / ¥2.1 output per 1M tokensBuilders who want PAYG API billingPublished on the StepFun pricing details page under `step-3.5-flash`.
Step Plan Flash Mini¥49 / monthLight tool-first usageSubscription route with 5-hour and weekly prompt quotas.
Step Plan Flash Plus¥99 / monthFrequent coding and agent workSame Step Plan route, higher quota tier.
Step Plan Flash Pro¥199 / monthHeavy individual useSame Step Plan route, larger quota tier.
Step Plan Flash Max¥699 / monthTeam or production-style usageHighest public Step Plan tier.
BuyGLM shows package prices in USD. When a source page is published in CNY, the displayed value uses a fixed 1 USD = 8 CNY conversion and should still be checked against the live vendor page before payment.

Use Step 3.5 Flash when inference cost and speed matter as much as accuracy

The Apache 2.0 weights are the fastest path for experimentation. The API is the fastest path for production deployments that need high-throughput agentic workflows at low cost.

Sources and official links

Frequently asked questions

What makes Step 3.5 Flash different from other MoE models?

Step 3.5 Flash has an unusually low active-parameter ratio (11B out of 196.81B) and uses Multi-Token Prediction (MTP-3) to generate 4 tokens per forward pass. This combination yields up to 350 TPS and dramatically lower decoding costs compared to peers.

Is Step 3.5 Flash open source?

Yes. Step 3.5 Flash is released under the Apache 2.0 license. Weights are available on GitHub and Hugging Face.

How does Step 3.5 Flash compare on coding benchmarks?

It scores 74.4% on SWE-bench Verified and 51.0% on Terminal-Bench 2.0. While slightly below the top scorers on SWE-bench (e.g., K2.5 at 76.8%), it delivers competitive results at a fraction of the inference cost.