model release

Zyphra Releases ZAYA1-8B: 8.4B Parameter MoE Model with 760M Active Parameters Matches 80B+ Models on Math Benchmarks

TL;DR

Zyphra has released ZAYA1-8B, a mixture-of-experts language model with 760M active parameters and 8.4B total parameters. The model scores 89.1% on AIME 2026, competitive with models exceeding 100B parameters, while maintaining efficiency for on-device deployment.

May 7, 2026 · 12:35 AM2 min read

ZAYA1-8B — Quick Specs

Compare ZAYA1-8B with other models →

Zyphra ZAYA1-8B Achieves Frontier-Level Math Performance with 760M Active Parameters

Zyphra has released ZAYA1-8B, a mixture-of-experts (MoE) language model with 760M active parameters and 8.4B total parameters that achieves competitive performance with models over 10x its size on mathematical reasoning benchmarks.

Benchmark Performance

ZYPHA1-8B scores 89.1% on AIME 2026, outperforming Qwen3-4B-Thinking-2507 (77.5%) and Gemma-4-E4B-it (50.3%). According to Zyphra, the model matches or exceeds performance of significantly larger reasoning models:

AIME 2026: 89.1% (vs. 90.2% for Qwen3-Next-80B-A3B-Think with 80B total parameters)
HMMT February 2026: 71.6% (vs. 79.3% for Qwen3-Next-80B)
LiveCodeBench v6: 63.8% (comparable to larger models)
GPQA-Diamond: 71.0%
MMLU-Pro: 74.2%
IFEval: 85.8%

The model also scores 59.3% on IMO-AnswerBench and 32.2% on APEX-shortlist, significantly ahead of same-class models.

Architecture and Efficiency

ZYPHA1-8B uses a mixture-of-experts architecture with only 760M parameters active during inference while maintaining 8.4B total parameters. This design enables on-device deployment despite its competitive performance with frontier models like Mistral-Small-4-119B (6B active, 119B total) and Intellect-3 (12B active, 106B total).

The model requires specific installation from Zyphra's forked versions of vLLM and Transformers libraries. Deployment requires the --mamba-cache-dtype float32 --dtype bfloat16 flags and uses a custom reasoning parser.

Technical Specifications

Active parameters: 760M
Total parameters: 8.4B
Model type: Mixture of experts with reasoning capabilities
Inference format: Requires vLLM server with custom flags
Recommended dtype: bfloat16 with float32 mamba cache

Availability

The post-trained reasoning version is available on Hugging Face. Zyphra has also released the pretraining base model separately. Pricing information has not been disclosed.

What This Means

ZYPHA1-8B demonstrates that mixture-of-experts architectures can achieve frontier-level mathematical reasoning with a fraction of the active parameters typically required. The 760M active parameter count makes it viable for edge deployment scenarios where models like Qwen3-Next-80B (3B active, 80B total) would be impractical. However, the model's relative weakness on creative writing tasks (62.97% on Creative Writing v3 vs. 83.75% for Gemma-4-E4B) and agentic benchmarks (39.22% on BFCL-v4) suggests the efficiency gains come with tradeoffs in general capability. The requirement for custom library forks may limit immediate adoption.

Source: huggingface.co ↗

Zyphra ZAYA1-8B mixture-of-experts MoE reasoning mathematics coding on-device

model releaseMay 6, 2026

Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters

Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.

model releaseMay 2, 2026

NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode

NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.

model releaseApril 29, 2026

Mistral Releases Medium 3.5: 128B Dense Model With 256k Context and Configurable Reasoning

Mistral AI released Mistral Medium 3.5, a 128B parameter dense model with a 256k context window that unifies instruction-following, reasoning, and coding capabilities. The model features configurable reasoning effort per request and a vision encoder trained from scratch for variable image sizes.