model release

Allen Institute releases EMO, 14B parameter MoE model with selective 12.5% expert use

TL;DR

Allen Institute for AI released EMO, a 1B-active, 14B-total-parameter mixture-of-experts model trained on 1 trillion tokens. The model uses 8 active experts per token from a pool of 128 total experts, and can maintain near full-model performance while using just 12.5% of its experts for specific tasks.

2 min read
0

Allen Institute releases EMO, 14B parameter MoE model with selective 12.5% expert use

Allen Institute for AI (AI2) released EMO, a mixture-of-experts model that can maintain near full-model performance while using just 12.5% of its total experts for specific tasks. The model contains 14 billion total parameters with 1 billion active per forward pass, trained on 1 trillion tokens.

Model architecture and capabilities

EMO uses 8 active experts per token selected from a pool of 128 total experts. According to AI2, when limited to a 16-expert subset (12.5% of total experts), the model shows only a 3% absolute performance drop across benchmarks. At 32 experts (25% of total), the degradation is approximately 1%.

The model differs from standard MoE architectures in how experts specialize. Rather than organizing around low-level lexical patterns like prepositions or punctuation, EMO's experts form coherent groups around higher-level domains and capabilities.

Training methodology

EMO uses document boundaries as a supervisory signal during training. All tokens within a single document are constrained to route through the same subset of experts, rather than allowing each token to independently select experts. This document-level routing encourages groups of experts to specialize in consistent domains.

The model implements global load balancing across many documents rather than local balancing within micro-batches. This prevents the model from collapsing onto a small number of experts while still allowing document-level expert consistency.

Document pool sizes are randomly sampled during training rather than fixed, allowing the model to support different expert subset sizes at inference time.

Benchmark performance

On general-purpose benchmarks, EMO matches the performance of a standard MoE model with equivalent architecture trained on the same data, according to AI2. The performance advantage appears when using expert subsets: a standard MoE with the same architecture degrades sharply when limited to small expert subsets, while EMO maintains robustness.

Task-specific expert subsets are constructed by ranking experts based on routing usage on small validation datasets, then keeping only the most-used experts.

What this means

EMO demonstrates that mixture-of-experts models can be trained to support modular deployment without sacrificing general-purpose performance. The ability to use 12.5% of experts while maintaining near full-model performance addresses a key limitation in current MoE architectures, where all experts typically need to be loaded even for narrow tasks.

The approach avoids requiring predefined domain labels across the pretraining corpus, instead letting domain specialization emerge from document-level routing patterns. This could enable more flexible deployment options for large sparse models, particularly for users who need specific capabilities without the computational cost of hosting the full parameter set.

Code, models, and technical paper are available through AI2's GitHub and Hugging Face collections.

Related Articles

model release

Poolside releases Laguna M.1: 225B parameter MoE model scores 74.6% on SWE-bench Verified

Poolside has released Laguna M.1, a 225B total parameter Mixture-of-Experts model with 23B activated parameters per token, designed for agentic coding tasks. The model scores 74.6% on SWE-bench Verified and 63.1% on SWE-bench Multilingual, released under Apache 2.0 license.

model release

Mistral Releases Mistral 3 Family: 675B-Parameter Large 3 MoE and Three Edge Models Under Apache 2.0

Mistral has released Mistral 3, including Mistral Large 3—a sparse mixture-of-experts model with 41B active and 675B total parameters—and three Ministral 3 edge models (3B, 8B, 14B). All models are released under Apache 2.0 license with multimodal capabilities and are available today on multiple platforms.

model release

Cohere releases North Mini Code, a 30B-parameter sparse MoE coding model with 256K context window, free on OpenRouter

Cohere has released North Mini Code, the first model in its North family and its first agentic coding model. The sparse mixture-of-experts architecture features 30B total parameters with 3B active, a 256K-token context window, and up to 64K tokens of output, available free via OpenRouter under Apache 2.0 license.

model release

NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

NVIDIA released a quantized version of Google DeepMind's DiffusionGemma 26B A4B IT, a multimodal model with 25.2B total parameters (3.8B active) that processes text, image, and video inputs. The NVFP4-quantized model achieves generation speeds exceeding 1,100 tokens per second on NVIDIA H100 GPUs while supporting a 256K token context window.

Comments

Loading...