StepFun Releases Step-3.7-Flash: 198B-Parameter Sparse MoE Model With 256K Context in GGUF Format
StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates approximately 11B parameters per token. The model supports a 256K context window, native image understanding via a 1.8B-parameter vision encoder, and offers three selectable reasoning levels.
StepFun Releases Step-3.7-Flash: 198B-Parameter Sparse MoE Model With 256K Context in GGUF Format
StepFun has released Step-3.7-Flash in GGUF quantization format, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates approximately 11B parameters per token. The model is now available for local deployment on consumer hardware with 128GB of unified memory.
Model Architecture and Capabilities
Step-3.7-Flash combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. According to StepFun, the sparse architecture enables throughput up to 400 tokens per second while maintaining a 256K token context window.
The model offers three selectable reasoning levels—low, medium, and high—designed to balance speed, cost, and reasoning depth. StepFun positions the model for agentic workloads including tool calling, multi-step reasoning, code generation, and mathematics, with native multilingual support.
Quantization Options and Memory Requirements
StepFun released seven quantization variants ranging from BF16 (394GB) to IQ3_XXS (76GB). The recommended Q4_K_S quantization requires 112GB, enabling full 256K context inference on devices with 128GB unified memory, including Apple Mac Studio (M4 Max), NVIDIA DGX Spark (GB10), and AMD Ryzen AI Max+ 395.
All quantizations below Q8_0 use imatrix calibration. A separate 4GB vision projector (mmproj) file enables multimodal inference when paired with any language quantization.
Benchmark Performance
On Apple Mac Studio (M4 Max, 128GB), the Q4_K_S quantization achieved:
- 420 tokens/second prompt processing at 2K context
- 48.5 tokens/second generation at 2K context
- 110 tokens/second prompt processing at 262K context
- 9.7 tokens/second generation at 262K context
On NVIDIA DGX Spark (GB10, 128GB), the same quantization reached 753 tokens/second prompt processing at 8K context and 26 tokens/second generation at 2K context.
The IQ4_XS quantization (105GB) delivered comparable performance with 7GB less memory usage.
Implementation Details
The model requires a custom branch of llama.cpp maintained by StepFun. Users must build from the step3.7 branch to access compatibility with the model's architecture. StepFun provides command-line inference tools and an OpenAI-compatible server for both text-only and multimodal workloads.
Pricing information has not been disclosed. The model files are available on Hugging Face.
What This Means
Step-3.7-Flash represents a significant accessibility milestone for large-scale vision-language models. By activating only 11B of 198B parameters per token, StepFun has made a model with GPT-4-class parameter count deployable on high-end consumer hardware. The 256K context window—previously limited to cloud-based models—can now run locally at speeds viable for production use cases. However, generation speeds drop substantially at maximum context (under 10 t/s at 262K), suggesting practical use will favor shorter contexts. The model's viability for production depends on benchmark scores against established models, which StepFun has not yet published.
Related Articles
StepFun launches Step 3.7 Flash: 196B MoE model with 256K context and adjustable reasoning levels at $0.20/$1.15 per 1M
StepFun has released Step 3.7 Flash, a 196B-parameter Mixture-of-Experts model that activates approximately 11B parameters per token. The multimodal model supports a 256K context window and introduces selectable reasoning levels (high/medium/low), priced at $0.20 per 1M input tokens and $1.15 per 1M output tokens.
Liquid AI Releases LFM2.5-8B: 8-Billion Parameter Hybrid Model Optimized for Edge Deployment
Liquid AI has released LFM2.5-8B-A1B, an 8-billion parameter hybrid model designed specifically for edge AI and on-device deployment. The model is available in multiple GGUF quantized formats ranging from 4-bit (4.84 GB) to 16-bit (16.9 GB), optimized for memory efficiency.
NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architectur
NVIDIA has released Cosmos 3, a unified omni-model that combines world generation, physical reasoning, and action generation in a single architecture. Available in 8B (Nano) and 32B (Super) parameter versions on Hugging Face, Cosmos 3 uses a Mixture-of-Transformers architecture to process text, image, video, audio, and action modalities without switching between separate models.
MiniMax Launches M3 Model With 1M Context Window at $0.30 Per Million Input Tokens
MiniMax has released M3, a multimodal foundation model supporting text, image, and video inputs with a 1-million-token context window. The model costs $0.30 per million input tokens and $1.20 per million output tokens, available through OpenRouter.
Comments
Loading...