model release

StepFun releases Step-3.7-Flash: 198B-parameter MoE model with 256K context at $0.20/M input tokens

TL;DR

StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates 11B parameters per token and delivers up to 400 tokens per second. The model supports a 256K context window, three selectable reasoning levels, and is priced at $0.20 per million input tokens (cache miss) and $1.15 per million output tokens.

2 min read
0

Step-3.7-Flash — Quick Specs

Context window256K tokens
Input$0.2/1M tokens
Output$1.15/1M tokens

StepFun Releases Step-3.7-Flash: 198B-Parameter MoE Model

StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that activates approximately 11B parameters per token. The model combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder and delivers throughput of up to 400 tokens per second.

Core Specifications

Step-3.7-Flash supports a 256K context window and offers three selectable reasoning levels (low, medium, and high) for developers to balance speed, cost, and cognitive depth. The model is designed for production workloads requiring perception, search, and reasoning across multi-step workflows.

Benchmark Performance

According to StepFun, Step-3.7-Flash achieves:

  • SimpleVQA (Search): 79.2 (first place)
  • *V (Python)**: 95.3
  • ClawEval-1.1: 67.1 (significantly ahead of second place at 59.8)
  • SWE-Bench PRO: 56.3 (second place)
  • Toolathlon: 49.5
  • HLE w. Tool: 48.1
  • Terminal-Bench 2.1: 59.5
  • GDPVal-AA: 45.8

The ClawEval-1.1 score demonstrates high resistance to adversarial traps during multi-turn orchestration, while the SWE-Bench PRO result shows capability in tracing multi-file repositories and generating functional patches.

Pricing

Step-3.7-Flash pricing structure:

  • Input (cache miss): $0.20 per million tokens
  • Input (cache hit): $0.04 per million tokens
  • Output: $1.15 per million tokens

Availability and Deployment

The model is available through StepFun's Open Platform (platform.stepfun.ai for global access and platform.stepfun.com for China), OpenRouter, and NVIDIA NIM. StepFun is partnering with DeepInfra, Fireworks AI, and Modal for expanded availability.

For local deployment, Step-3.7-Flash supports vLLM, SGLang, Hugging Face Transformers, and llama.cpp. The model can run on NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio/MacBook Pro devices with at least 128GB unified memory.

StepFun model support has been integrated into the NVIDIA Nemo ecosystem, including AutoModel, Megatron Core, and Megatron Bridge.

Technical Architecture

The sparse MoE architecture activates only 11B of the total 198B parameters per token, enabling the claimed 400 tokens per second throughput. The 1.8B-parameter vision encoder provides native image understanding for processing UI wireframes, application GUIs, and data charts.

What This Means

Step-3.7-Flash targets the growing segment of agentic workflows that require multi-modal understanding and tool orchestration at production scale. The sparse MoE architecture delivers a favorable compute-to-capability ratio compared to dense models of similar performance, though real-world latency and throughput will depend heavily on deployment infrastructure. The $0.20/M input token pricing positions it competitively against frontier models from larger providers, particularly for workloads that benefit from the 256K context window and cache hit pricing of $0.04/M tokens.

Related Articles

model release

StepFun launches Step 3.7 Flash: 196B MoE model with 256K context and adjustable reasoning levels at $0.20/$1.15 per 1M

StepFun has released Step 3.7 Flash, a 196B-parameter Mixture-of-Experts model that activates approximately 11B parameters per token. The multimodal model supports a 256K context window and introduces selectable reasoning levels (high/medium/low), priced at $0.20 per 1M input tokens and $1.15 per 1M output tokens.

model release

Anthropic releases Claude Opus 4.8 with improved agentic coding and reasoning benchmarks

Anthropic released Claude Opus 4.8 on May 28, 2026, with improved performance in agentic coding, computer use, and reasoning benchmarks. Pricing remains at $5 per million input tokens and $25 per million output tokens, while the model's fast mode is now three times cheaper than previous versions.

model release

Anthropic releases Claude Opus 4.8 with 69.2% agentic coding score, 2.5x faster performance

Anthropic released Claude Opus 4.8 on May 28, 2026, six weeks after version 4.7. The model achieves 69.2% on agentic coding benchmarks (up from 64.3%), runs 2.5 times faster in fast mode at one-third the cost, while maintaining the same pricing as version 4.7.

model release

Mistral AI Releases Small 4: 119B Parameter Open-Source Model with 256K Context Under Apache 2.0

Mistral AI has released Mistral Small 4, a 119B total parameter mixture-of-experts model with 256K context window and native multimodal capabilities. The model uses 128 experts with 4 active per token (6B active parameters) and is released under the Apache 2.0 license, marking Mistral's first unified model combining reasoning, multimodal, and coding capabilities.

Comments

Loading...