Zyphra Releases ZAYA1-8B: 8.4B Parameter MoE Model with 760M Active Parameters Matches 80B+ Models on Math Benchmarks
Zyphra has released ZAYA1-8B, a mixture-of-experts language model with 760M active parameters and 8.4B total parameters. The model scores 89.1% on AIME 2026, competitive with models exceeding 100B parameters, while maintaining efficiency for on-device deployment.
Zyphra ZAYA1-8B Achieves Frontier-Level Math Performance with 760M Active Parameters
Zyphra has released ZAYA1-8B, a mixture-of-experts (MoE) language model with 760M active parameters and 8.4B total parameters that achieves competitive performance with models over 10x its size on mathematical reasoning benchmarks.
Benchmark Performance
ZYPHA1-8B scores 89.1% on AIME 2026, outperforming Qwen3-4B-Thinking-2507 (77.5%) and Gemma-4-E4B-it (50.3%). According to Zyphra, the model matches or exceeds performance of significantly larger reasoning models:
- AIME 2026: 89.1% (vs. 90.2% for Qwen3-Next-80B-A3B-Think with 80B total parameters)
- HMMT February 2026: 71.6% (vs. 79.3% for Qwen3-Next-80B)
- LiveCodeBench v6: 63.8% (comparable to larger models)
- GPQA-Diamond: 71.0%
- MMLU-Pro: 74.2%
- IFEval: 85.8%
The model also scores 59.3% on IMO-AnswerBench and 32.2% on APEX-shortlist, significantly ahead of same-class models.
Architecture and Efficiency
ZYPHA1-8B uses a mixture-of-experts architecture with only 760M parameters active during inference while maintaining 8.4B total parameters. This design enables on-device deployment despite its competitive performance with frontier models like Mistral-Small-4-119B (6B active, 119B total) and Intellect-3 (12B active, 106B total).
The model requires specific installation from Zyphra's forked versions of vLLM and Transformers libraries. Deployment requires the --mamba-cache-dtype float32 --dtype bfloat16 flags and uses a custom reasoning parser.
Technical Specifications
- Active parameters: 760M
- Total parameters: 8.4B
- Model type: Mixture of experts with reasoning capabilities
- Inference format: Requires vLLM server with custom flags
- Recommended dtype: bfloat16 with float32 mamba cache
Availability
The post-trained reasoning version is available on Hugging Face. Zyphra has also released the pretraining base model separately. Pricing information has not been disclosed.
What This Means
ZYPHA1-8B demonstrates that mixture-of-experts architectures can achieve frontier-level mathematical reasoning with a fraction of the active parameters typically required. The 760M active parameter count makes it viable for edge deployment scenarios where models like Qwen3-Next-80B (3B active, 80B total) would be impractical. However, the model's relative weakness on creative writing tasks (62.97% on Creative Writing v3 vs. 83.75% for Gemma-4-E4B) and agentic benchmarks (39.22% on BFCL-v4) suggests the efficiency gains come with tradeoffs in general capability. The requirement for custom library forks may limit immediate adoption.
Related Articles
Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters
Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.
NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode
NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.
Mistral Releases Medium 3.5: 128B Dense Model With 256k Context and Configurable Reasoning
Mistral AI released Mistral Medium 3.5, a 128B parameter dense model with a 256k context window that unifies instruction-following, reasoning, and coding capabilities. The model features configurable reasoning effort per request and a vision encoder trained from scratch for variable image sizes.
NVIDIA Releases Nemotron 3 Nano Omni: 31B Multimodal Model With 256K Context and Reasoning Mode
NVIDIA released Nemotron 3 Nano Omni, a 31B parameter (30B active, 3B per token) multimodal model supporting video, audio, image, and text inputs. The model features a 256K token context window, reasoning mode with chain-of-thought, and tool calling capabilities.
Comments
Loading...