model releaseStepFun

StepFun Releases Step-3.7-Flash: 198B-Parameter Sparse MoE Model With 256K Context in GGUF Format

TL;DR

StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates approximately 11B parameters per token. The model supports a 256K context window, native image understanding via a 1.8B-parameter vision encoder, and offers three selectable reasoning levels.

June 1, 2026 · 8:06 AM2 min read

Step-3.7-Flash — Quick Specs

Context window256K tokens

Compare Step-3.7-Flash with other models →

StepFun Releases Step-3.7-Flash: 198B-Parameter Sparse MoE Model With 256K Context in GGUF Format

StepFun has released Step-3.7-Flash in GGUF quantization format, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates approximately 11B parameters per token. The model is now available for local deployment on consumer hardware with 128GB of unified memory.

Model Architecture and Capabilities

Step-3.7-Flash combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. According to StepFun, the sparse architecture enables throughput up to 400 tokens per second while maintaining a 256K token context window.

The model offers three selectable reasoning levels—low, medium, and high—designed to balance speed, cost, and reasoning depth. StepFun positions the model for agentic workloads including tool calling, multi-step reasoning, code generation, and mathematics, with native multilingual support.

Quantization Options and Memory Requirements

StepFun released seven quantization variants ranging from BF16 (394GB) to IQ3_XXS (76GB). The recommended Q4_K_S quantization requires 112GB, enabling full 256K context inference on devices with 128GB unified memory, including Apple Mac Studio (M4 Max), NVIDIA DGX Spark (GB10), and AMD Ryzen AI Max+ 395.

All quantizations below Q8_0 use imatrix calibration. A separate 4GB vision projector (mmproj) file enables multimodal inference when paired with any language quantization.

Benchmark Performance

On Apple Mac Studio (M4 Max, 128GB), the Q4_K_S quantization achieved:

420 tokens/second prompt processing at 2K context
48.5 tokens/second generation at 2K context
110 tokens/second prompt processing at 262K context
9.7 tokens/second generation at 262K context

On NVIDIA DGX Spark (GB10, 128GB), the same quantization reached 753 tokens/second prompt processing at 8K context and 26 tokens/second generation at 2K context.

The IQ4_XS quantization (105GB) delivered comparable performance with 7GB less memory usage.

Implementation Details

The model requires a custom branch of llama.cpp maintained by StepFun. Users must build from the step3.7 branch to access compatibility with the model's architecture. StepFun provides command-line inference tools and an OpenAI-compatible server for both text-only and multimodal workloads.

Pricing information has not been disclosed. The model files are available on Hugging Face.

What This Means

Step-3.7-Flash represents a significant accessibility milestone for large-scale vision-language models. By activating only 11B of 198B parameters per token, StepFun has made a model with GPT-4-class parameter count deployable on high-end consumer hardware. The 256K context window—previously limited to cloud-based models—can now run locally at speeds viable for production use cases. However, generation speeds drop substantially at maximum context (under 10 t/s at 262K), suggesting practical use will favor shorter contexts. The model's viability for production depends on benchmark scores against established models, which StepFun has not yet published.

Source: huggingface.co ↗

StepFun Step-3.7-Flash MoE vision-language GGUF local deployment quantization multimodal

model releaseJuly 14, 2026

PrismML releases Bonsai 27B, claims first 27B-parameter model to run on-device on iPhone at 4GB memory footprint

PrismML has released Bonsai 27B, claiming it's the first 27-billion parameter model capable of running on-device on iPhone. The model achieves 58-87 tokens per second on Apple's M5 Max chip with a 4GB memory footprint, using 1-bit and ternary quantization to fit within iPhone's approximately 6GB available app memory.

model releaseJuly 14, 2026

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

Google has released Gemma 4 E2B for TPU, a variant of its open-source Gemma 4 model optimized to run natively on the Tensor G5 chip in Pixel 10 devices. The multimodal model enables completely offline AI chat, image recognition, and audio transcription on Pixel 10, 10 Pro, 10 Pro XL, and 10 Pro Fold.

model releaseJuly 16, 2026

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

Nvidia released Cosmos 3 Edge, a world model designed for robots and vision AI agents to perceive and navigate physical environments in real time. The company announced partnerships with Japanese industrial giants including Fujitsu, Hitachi, and Kawasaki Heavy Industries as part of its physical AI expansion.

model releaseJuly 15, 2026

Mira Murati's Thinking Machines releases Inkling, 975B-parameter open-weight model trained on 45T tokens

Thinking Machines Lab released Inkling, a 975-billion-parameter mixture-of-experts model that uses 41 billion active parameters per task. The open-weight model was trained on 45 trillion tokens across text, image, audio, and video, marking the first public release from Mira Murati's AI startup.

StepFun Releases Step-3.7-Flash: 198B-Parameter Sparse MoE Model With 256K Context in GGUF Format

Step-3.7-Flash — Quick Specs

StepFun Releases Step-3.7-Flash: 198B-Parameter Sparse MoE Model With 256K Context in GGUF Format

Model Architecture and Capabilities

Quantization Options and Memory Requirements

Benchmark Performance

Implementation Details

What This Means

Related Articles

PrismML releases Bonsai 27B, claims first 27B-parameter model to run on-device on iPhone at 4GB memory footprint

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

Nvidia Launches Cosmos 3 Edge World Model for Physical AI, Forms Japan Industrial Coalition

Mira Murati's Thinking Machines releases Inkling, 975B-parameter open-weight model trained on 45T tokens

Comments