model release

Allen Institute releases EMO, 14B parameter MoE model with selective 12.5% expert use

TL;DR

Allen Institute for AI released EMO, a 1B-active, 14B-total-parameter mixture-of-experts model trained on 1 trillion tokens. The model uses 8 active experts per token from a pool of 128 total experts, and can maintain near full-model performance while using just 12.5% of its experts for specific tasks.

2 min read
0

Allen Institute releases EMO, 14B parameter MoE model with selective 12.5% expert use

Allen Institute for AI (AI2) released EMO, a mixture-of-experts model that can maintain near full-model performance while using just 12.5% of its total experts for specific tasks. The model contains 14 billion total parameters with 1 billion active per forward pass, trained on 1 trillion tokens.

Model architecture and capabilities

EMO uses 8 active experts per token selected from a pool of 128 total experts. According to AI2, when limited to a 16-expert subset (12.5% of total experts), the model shows only a 3% absolute performance drop across benchmarks. At 32 experts (25% of total), the degradation is approximately 1%.

The model differs from standard MoE architectures in how experts specialize. Rather than organizing around low-level lexical patterns like prepositions or punctuation, EMO's experts form coherent groups around higher-level domains and capabilities.

Training methodology

EMO uses document boundaries as a supervisory signal during training. All tokens within a single document are constrained to route through the same subset of experts, rather than allowing each token to independently select experts. This document-level routing encourages groups of experts to specialize in consistent domains.

The model implements global load balancing across many documents rather than local balancing within micro-batches. This prevents the model from collapsing onto a small number of experts while still allowing document-level expert consistency.

Document pool sizes are randomly sampled during training rather than fixed, allowing the model to support different expert subset sizes at inference time.

Benchmark performance

On general-purpose benchmarks, EMO matches the performance of a standard MoE model with equivalent architecture trained on the same data, according to AI2. The performance advantage appears when using expert subsets: a standard MoE with the same architecture degrades sharply when limited to small expert subsets, while EMO maintains robustness.

Task-specific expert subsets are constructed by ranking experts based on routing usage on small validation datasets, then keeping only the most-used experts.

What this means

EMO demonstrates that mixture-of-experts models can be trained to support modular deployment without sacrificing general-purpose performance. The ability to use 12.5% of experts while maintaining near full-model performance addresses a key limitation in current MoE architectures, where all experts typically need to be loaded even for narrow tasks.

The approach avoids requiring predefined domain labels across the pretraining corpus, instead letting domain specialization emerge from document-level routing patterns. This could enable more flexible deployment options for large sparse models, particularly for users who need specific capabilities without the computational cost of hosting the full parameter set.

Code, models, and technical paper are available through AI2's GitHub and Hugging Face collections.

Related Articles

model release

Zyphra Releases ZAYA1-8B: 8.4B Parameter MoE Model with 760M Active Parameters Matches 80B+ Models on Math Benchmarks

Zyphra has released ZAYA1-8B, a mixture-of-experts language model with 760M active parameters and 8.4B total parameters. The model scores 89.1% on AIME 2026, competitive with models exceeding 100B parameters, while maintaining efficiency for on-device deployment.

model release

Tencent Releases Hy3 Preview: Mixture-of-Experts Model with 262K Context and Configurable Reasoning

Tencent has released Hy3 preview, a Mixture-of-Experts model with a 262,144 token context window priced at $0.066 per million input tokens and $0.26 per million output tokens. The model features three configurable reasoning modes—disabled, low, and high—designed for agentic workflows and production environments.

model release

InclusionAI Releases Ring-2.6-1T: 1 Trillion Parameter Thinking Model with 63B Active Parameters

InclusionAI has released Ring-2.6-1T, a 1 trillion parameter-scale model with 63 billion active parameters and a 262,144-token context window. The model features adaptive reasoning modes and is designed for coding agents, tool use, and long-horizon task execution.

model release

Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction

Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.

Comments

Loading...