model release

StepFun releases Step-3.7-Flash: 198B-parameter MoE model with 256K context at $0.20/M input tokens

TL;DR

StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates 11B parameters per token and delivers up to 400 tokens per second. The model supports a 256K context window, three selectable reasoning levels, and is priced at $0.20 per million input tokens (cache miss) and $1.15 per million output tokens.

May 29, 2026 · 12:51 PM2 min read

Step-3.7-Flash — Quick Specs

Context window256K tokens

Compare Step-3.7-Flash with other models →

StepFun Releases Step-3.7-Flash: 198B-Parameter MoE Model

StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that activates approximately 11B parameters per token. The model combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder and delivers throughput of up to 400 tokens per second.

Core Specifications

Step-3.7-Flash supports a 256K context window and offers three selectable reasoning levels (low, medium, and high) for developers to balance speed, cost, and cognitive depth. The model is designed for production workloads requiring perception, search, and reasoning across multi-step workflows.

Benchmark Performance

According to StepFun, Step-3.7-Flash achieves:

SimpleVQA (Search): 79.2 (first place)
*V (Python)**: 95.3
ClawEval-1.1: 67.1 (significantly ahead of second place at 59.8)
SWE-Bench PRO: 56.3 (second place)
Toolathlon: 49.5
HLE w. Tool: 48.1
Terminal-Bench 2.1: 59.5
GDPVal-AA: 45.8

The ClawEval-1.1 score demonstrates high resistance to adversarial traps during multi-turn orchestration, while the SWE-Bench PRO result shows capability in tracing multi-file repositories and generating functional patches.

Pricing

Step-3.7-Flash pricing structure:

Input (cache miss): $0.20 per million tokens
Input (cache hit): $0.04 per million tokens
Output: $1.15 per million tokens

Availability and Deployment

The model is available through StepFun's Open Platform (platform.stepfun.ai for global access and platform.stepfun.com for China), OpenRouter, and NVIDIA NIM. StepFun is partnering with DeepInfra, Fireworks AI, and Modal for expanded availability.

For local deployment, Step-3.7-Flash supports vLLM, SGLang, Hugging Face Transformers, and llama.cpp. The model can run on NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio/MacBook Pro devices with at least 128GB unified memory.

StepFun model support has been integrated into the NVIDIA Nemo ecosystem, including AutoModel, Megatron Core, and Megatron Bridge.

Technical Architecture

The sparse MoE architecture activates only 11B of the total 198B parameters per token, enabling the claimed 400 tokens per second throughput. The 1.8B-parameter vision encoder provides native image understanding for processing UI wireframes, application GUIs, and data charts.

What This Means

Step-3.7-Flash targets the growing segment of agentic workflows that require multi-modal understanding and tool orchestration at production scale. The sparse MoE architecture delivers a favorable compute-to-capability ratio compared to dense models of similar performance, though real-world latency and throughput will depend heavily on deployment infrastructure. The $0.20/M input token pricing positions it competitively against frontier models from larger providers, particularly for workloads that benefit from the 256K context window and cache hit pricing of $0.04/M tokens.

Source: huggingface.co ↗

stepfun mixture-of-experts vision-language-model benchmark pricing deployment

model releaseJuly 9, 2026

OpenAI releases GPT-5.6 family in three sizes: Luna at $1/$6, Terra at $2.50/$15, Sol at $5/$30 per 1M tokens

OpenAI released its GPT-5.6 flagship model family in three sizes: Luna ($1/$6 per 1M tokens), Terra ($2.50/$15), and Sol ($5/$30). The company claims GPT-5.6 Sol scores 53.6 on the Agents' Last Exam benchmark, outperforming Claude Fable 5's score by 13.1 points.

model releaseJuly 9, 2026

Meta launches Muse Spark 1.1 coding model at $1.25/$4.25 per million tokens

Meta publicly released Muse Spark 1.1, a multimodal AI model designed for agentic coding workflows. The model is priced at $1.25 per million input tokens and $4.25 per million output tokens, positioning it slightly above Anthropic's Claude Haiku 4.5 and OpenAI's GPT-5.6 Luna.

model releaseJuly 9, 2026

OpenAI Releases GPT-5.6 Luna Pro with Extended Reasoning Mode at $1/$6 Per Million Tokens

OpenAI has released GPT-5.6 Luna Pro, a reasoning-enhanced variant of GPT-5.6 Luna with a 1 million token context window. The model is priced at $1 per million input tokens and $6 per million output tokens, with a knowledge cutoff date of February 2026.

model releaseJuly 9, 2026

OpenAI Releases GPT-5.6 Sol Pro with Extended Reasoning Mode at $5 Input/$30 Output per 1M Tokens

OpenAI has released GPT-5.6 Sol Pro, a reasoning-enhanced variant of GPT-5.6 Sol designed for complex tasks. The model features a 1 million token context window, February 2026 knowledge cutoff, and is priced at $5 per 1M input tokens and $30 per 1M output tokens.

StepFun releases Step-3.7-Flash: 198B-parameter MoE model with 256K context at $0.20/M input tokens

Step-3.7-Flash — Quick Specs

StepFun Releases Step-3.7-Flash: 198B-Parameter MoE Model

Core Specifications

Benchmark Performance

Pricing

Availability and Deployment

Technical Architecture

What This Means

Related Articles

OpenAI releases GPT-5.6 family in three sizes: Luna at $1/$6, Terra at $2.50/$15, Sol at $5/$30 per 1M tokens

Meta launches Muse Spark 1.1 coding model at $1.25/$4.25 per million tokens

OpenAI Releases GPT-5.6 Luna Pro with Extended Reasoning Mode at $1/$6 Per Million Tokens

OpenAI Releases GPT-5.6 Sol Pro with Extended Reasoning Mode at $5 Input/$30 Output per 1M Tokens

Comments