mixture-of-experts
36 articles tagged with mixture-of-experts
Tencent Releases Hy3 Preview: Mixture-of-Experts Model with 262K Context and Configurable Reasoning
Tencent has released Hy3 preview, a Mixture-of-Experts model with a 262,144 token context window priced at $0.066 per million input tokens and $0.26 per million output tokens. The model features three configurable reasoning modes—disabled, low, and high—designed for agentic workflows and production environments.
Allen Institute releases EMO, 14B parameter MoE model with selective 12.5% expert use
Allen Institute for AI released EMO, a 1B-active, 14B-total-parameter mixture-of-experts model trained on 1 trillion tokens. The model uses 8 active experts per token from a pool of 128 total experts, and can maintain near full-model performance while using just 12.5% of its experts for specific tasks.
Zyphra Releases ZAYA1-8B: 8.4B Parameter MoE Model with 760M Active Parameters Matches 80B+ Models on Math Benchmarks
Zyphra has released ZAYA1-8B, a mixture-of-experts language model with 760M active parameters and 8.4B total parameters. The model scores 89.1% on AIME 2026, competitive with models exceeding 100B parameters, while maintaining efficiency for on-device deployment.
Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction
Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.
Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters
Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.
Poolside releases Laguna XS.2: 33B parameter MoE coding model with 131K context window
Poolside has released Laguna XS.2, a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token, designed for agentic coding. The model features a 131,072-token context window, scores 68.2% on SWE-bench Verified, and is available under Apache 2.0 license with free API access.
Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window
Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.
Alibaba Releases Qwen3.6 Max Preview: 1 Trillion Parameter MoE Model With 262K Context Window
Alibaba Cloud has released Qwen3.6 Max Preview, a proprietary frontier model built on sparse mixture-of-experts architecture with approximately 1 trillion total parameters. The model supports a 262,144-token context window and features integrated thinking mode for multi-turn reasoning, priced at $1.30 per million input tokens and $7.80 per million output tokens.
DeepSeek V4 cuts inference costs with 1.6T parameter model using 13.7x less memory than V3
DeepSeek released V4 in two versions: a 284 billion parameter Flash model and a 1.6 trillion parameter Pro model with 49 billion active parameters. According to DeepSeek, the models use 9.5x-13.7x less memory than V3 through compressed attention mechanisms and FP4/FP8 mixed precision, while supporting a 1 million token context window.
DeepSeek V4 Pro launches with 1.6 trillion parameters, 1M token context at $0.145 per million input tokens
Chinese AI lab DeepSeek has released preview versions of DeepSeek V4 Flash and V4 Pro, mixture-of-experts models with 1 million token context windows. The V4 Pro has 1.6 trillion total parameters (49 billion active), making it the largest open-weight model available, while both models significantly undercut frontier model pricing.
DeepSeek V4 Pro launches with 1.6T parameters at $1.74/M tokens, undercutting Claude Sonnet 4.6 by 42%
DeepSeek released two preview models: V4 Pro (1.6T total parameters, 49B active) and V4 Flash (284B total, 13B active), both with 1 million token context windows. V4 Pro is priced at $1.74/M input tokens and $3.48/M output—42% cheaper than Claude Sonnet 4.6—while V4 Flash at $0.14/$0.28 per million tokens undercuts all small frontier models.
DeepSeek Releases V4-Flash: 284B-Parameter MoE Model With 1M Token Context at 27% Inference Cost
DeepSeek released two Mixture-of-Experts models: V4-Flash with 284B total parameters (13B activated) and V4-Pro with 1.6T parameters (49B activated). Both models support one million token context windows and use a hybrid attention architecture that requires only 27% of the inference FLOPs compared to DeepSeek-V3.2 at 1M token context.
Tencent Releases Hy3-Preview: 295B-Parameter MoE Model with 21B Active Parameters
Tencent has released Hy3-preview, a 295-billion-parameter Mixture-of-Experts model with 21 billion active parameters and a 256K context window. The model scores 76.28% on MATH and 34.86% on LiveCodeBench-v6, with particularly strong performance on coding agent tasks.
Arcee AI releases Trinity-Large-Thinking, open reasoning model matching Claude Opus on agent tasks
Arcee AI has released Trinity-Large-Thinking, a 400-billion-parameter open-weight reasoning model with a mixture-of-experts architecture that activates only 13 billion parameters per token. The model matches Claude Opus 4.6 on agent benchmarks like Tau2 and PinchBench but lags on general reasoning tasks. The company spent approximately $20 million—roughly half its total venture capital—to train the model on 2,048 Nvidia B300 GPUs over 33 days.
Arcee AI releases Trinity-Large-Thinking: 398B sparse MoE model with chain-of-thought reasoning
Arcee AI released Trinity-Large-Thinking, a 398B-parameter sparse Mixture-of-Experts model with approximately 13B active parameters per token, post-trained with extended chain-of-thought reasoning for agentic workflows. The model achieves 94.7% on τ²-Bench, 91.9% on PinchBench, and 98.2% on LiveCodeBench, generating explicit reasoning traces in <think>...</think> blocks before producing responses.
Google DeepMind releases Gemma 4 with four model sizes, up to 256K context, multimodal support
Google DeepMind released Gemma 4, an open-weights multimodal model family in four sizes (2.3B to 31B parameters) with context windows up to 256K tokens. All models support text and image input, with audio native to E2B and E4B variants. The Gemma 4 31B dense model scores 85.2% on MMLU Pro, 89.2% on AIME 2026, and 80.0% on LiveCodeBench—significant improvements over Gemma 3.
Google releases Gemma 4 26B with 256K context and multimodal support, free to use
Google DeepMind has released Gemma 4 26B A4B, a free instruction-tuned Mixture-of-Experts model with 262,144 token context window and multimodal capabilities including text, images, and video input. Despite 25.2B total parameters, only 3.8B activate per token, delivering performance comparable to larger 31B models at reduced compute cost.
Google DeepMind releases Gemma 4 with four models up to 31B parameters, 256K context window
Google DeepMind released Gemma 4, an open-weights multimodal model family in four sizes (E2B, E4B, 26B A4B, 31B) with context windows up to 256K tokens and native reasoning capabilities. The 26B A4B variant uses Mixture-of-Experts architecture with 3.8B active parameters for efficient inference. All models support text, image input and handle 140+ languages with Apache 2.0 licensing.
Google DeepMind releases Gemma 4, open multimodal models with 256K context and reasoning
Google DeepMind has released Gemma 4, a family of open-weights multimodal models ranging from 2.3B to 31B parameters with support for text, images, video, and audio. The models feature context windows up to 256K tokens, built-in reasoning modes, and native function calling for agentic workflows.
Google DeepMind releases Gemma 4 open models with up to 256K context and multimodal reasoning
Google DeepMind has released Gemma 4, an open-weights model family in four sizes (2.3B to 31B parameters) with multimodal capabilities handling text, images, video, and audio. The 26B A4B variant uses mixture-of-experts to achieve 4B active parameters while supporting 256K token context windows and native reasoning modes.
Google DeepMind releases Gemma 4 family with 256K context window and multimodal capabilities
Google DeepMind released the Gemma 4 family of open-weights models in four sizes (2.3B to 31B parameters) with multimodal support for text, images, video, and audio. The flagship 31B model achieves 85.2% on MMLU Pro and 89.2% on AIME 2024, with context windows up to 256K tokens. All models feature configurable reasoning modes and are optimized for deployment from mobile devices to servers under Apache 2.0 license.
Google DeepMind releases Gemma 4 with 4 model sizes, 256K context, and multimodal reasoning
Google DeepMind released Gemma 4, a family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (3.8B active), and 31B (30.7B parameters). All models support text and image input with 128K-256K context windows, while E2B and E4B add native audio capabilities and reasoning modes across 140+ languages.
Google releases Gemma 4 family under Apache 2.0 license with 2B to 31B models
Google has released Gemma 4, a family of four open models ranging from 2B to 31B parameters, now available under the Apache 2.0 license for the first time. The 31B dense model ranks 3rd on the Arena AI Text Leaderboard, while the 26B mixture-of-experts variant ranks 6th, both outperforming significantly larger competitors. All models support multimodal inputs and are available on Hugging Face, Kaggle, and Ollama.
Google DeepMind releases Gemma 4: open models ranking #3 and #6 on Arena AI leaderboard
Google DeepMind released Gemma 4, a family of four open models ranging from 2B to 31B parameters, all licensed under Apache 2.0. The 31B dense model ranks #3 on Arena AI's text leaderboard and the 26B mixture-of-experts variant ranks #6, outperforming closed models significantly larger in size.
Alibaba releases Qwen 3.6 Plus with 1M context window, free tier now available
Alibaba's Qwen division released Qwen 3.6 Plus on April 2, 2026, offering free access to a model with a 1,000,000 token context window. The model combines linear attention with sparse mixture-of-experts routing and achieves a 78.8 score on SWE-bench Verified for software engineering tasks.
Holo3 achieves 78.85% on OSWorld benchmark with only 10B active parameters
H Company unveiled Holo3, a computer use model that scores 78.85% on the OSWorld-Verified benchmark—the highest on the leading desktop automation benchmark. The model achieves this with only 10B active parameters (122B total), positioning it as a lower-cost alternative to proprietary models like GPT 5.4 and Opus 4.6.
NVIDIA releases gpt-oss-puzzle-88B, 88B-parameter reasoning model with 1.63× throughput gains
NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a 88-billion parameter mixture-of-experts model optimized for inference efficiency on H100 hardware. Built using the Puzzle post-training neural architecture search framework, the model achieves 1.63× throughput improvement in long-context (64K/64K) scenarios and up to 2.82× improvement on single H100 GPUs compared to its parent gpt-oss-120B, while matching or exceeding accuracy across reasoning effort levels.
Rakuten releases RakutenAI-3.0, 671B-parameter Japanese-optimized mixture-of-experts model
Rakuten Group has released RakutenAI-3.0, a 671 billion parameter mixture-of-experts (MoE) model designed specifically for Japanese language tasks. The model activates 37 billion parameters per token and supports a 128K context window. It is available under the Apache License 2.0 on Hugging Face.
Nvidia releases Nemotron 3 Super: 120B MoE model with 1M token context
Nvidia has released Nemotron 3 Super, a 120-billion parameter hybrid Mamba-Transformer Mixture-of-Experts model that activates only 12 billion parameters during inference. The open-weight model features a 1-million token context window, multi-token prediction capabilities, and pricing at $0.10 per million input tokens and $0.50 per million output tokens.
NVIDIA Nemotron 3 Super now available on Amazon Bedrock with 256K context window
NVIDIA Nemotron 3 Super, a hybrid Mixture of Experts model with 120B parameters and 12B active parameters, is now available as a fully managed model on Amazon Bedrock. The model supports up to 256K token context length and claims 5x higher throughput efficiency over the previous Nemotron Super and 2x higher accuracy on reasoning tasks.
Xiaomi launches MiMo-V2-Pro with 1T parameters, matches Claude Opus on coding at 80% lower cost
Xiaomi shipped three AI models simultaneously designed to form a complete agent platform. MiMo-V2-Pro, a 1-trillion-parameter Mixture-of-Experts model with 42 billion active parameters per request, scores 78% on SWE-bench Verified and 81 points on ClawEval—nearly matching Claude Opus 4.6 while costing $1 per million input tokens versus $5 for Opus.
NVIDIA releases Nemotron-3-Super-120B, a 120B parameter model with latent MoE architecture
NVIDIA has released Nemotron-3-Super-120B-A12B-NVFP4, a 120-billion parameter text generation model featuring a latent Mixture-of-Experts (MoE) architecture. The model supports 8 languages including English, French, Spanish, Italian, German, Japanese, and Chinese, and is available on Hugging Face with 8-bit quantization support through NVIDIA's ModelOpt toolkit.
NVIDIA releases Nemotron-3-Super-120B, a 120B parameter model with latent MoE architecture
NVIDIA has released Nemotron-3-Super-120B-A12B-BF16, a 120 billion parameter model designed for text generation and conversational tasks. The model employs a latent mixture-of-experts (MoE) architecture and supports multiple languages including English, French, Spanish, Italian, German, Japanese, and Chinese.
Alibaba releases Qwen3.5-35B-A3B, a 35B multimodal model with Apache 2.0 license
Alibaba has released Qwen3.5-35B-A3B, a 35-billion parameter multimodal model capable of processing images and text. The model is published under an Apache 2.0 license and available on Hugging Face with Transformers and SafeTensors format support.
Liquid AI releases LFM2-24B-A2B, a 24B parameter mixture-of-experts model
Liquid AI has released LFM2-24B-A2B, a 24-billion parameter mixture-of-experts model designed for text generation and conversational tasks. The model supports nine languages including English, Arabic, Chinese, French, German, Japanese, Korean, Spanish, and Portuguese.
Segmind releases SegMoE, a mixture-of-experts diffusion model for faster image generation
Segmind has released SegMoE, a mixture-of-experts (MoE) diffusion model designed to accelerate image generation while reducing computational overhead. The model applies MoE techniques traditionally used in large language models to the diffusion model architecture, enabling selective expert activation during inference.