JetBrains Releases Mellum2-12B Reasoning Model with 131K Context and Mixture-of-Experts Architecture
JetBrains has released Mellum2-12B-A2.5B-Thinking, a reasoning-augmented assistant model with 131,072-token context window and 64 Mixture-of-Experts architecture that activates 8 experts per token. The model emits explicit chain-of-thought reasoning inside <think> blocks before providing final answers.
Mellum2-12B-A2.5B-Thinking — Quick Specs
JetBrains Releases Mellum2-12B Reasoning Model with 131K Context and Mixture-of-Experts Architecture
JetBrains has released Mellum2-12B-A2.5B-Thinking, a reasoning-augmented assistant model with a 131,072-token context window that emits explicit chain-of-thought reasoning inside <think>...</think> blocks before providing final answers.
Architecture and Training
The model uses a Mixture-of-Experts (MoE) architecture with 64 experts, activating 8 experts per token. It features 28 layers with a hidden size of 2,304 and uses grouped-query attention with 32 query heads and 4 key-value heads. The architecture combines sliding-window attention (1,024 tokens) with full attention layers.
According to JetBrains, the model was produced from Mellum2-12B-A2.5B-Base through supervised fine-tuning (computing loss only on the final assistant turn), followed by reinforcement learning with verifiable rewards (RLVR) on a harder data mix that includes long-form math problems.
Benchmark Performance
On self-reported benchmarks, the Thinking variant scores 69.9% on LiveCodeBench v6, 58.4% on AIME (mean of 2025 and 2026, 30 questions each), and 87.0% on GSM-Plus. On MMLU-Redux, it achieves 86.2% accuracy.
The model scores 45.6% on Berkeley Function Calling Leaderboard (BFCL) v4, which measures tool-calling capability across five subtasks. On conversational tasks, it achieves 76.5% on IFEval and 66.9% on MixEval.
For comparison, JetBrains reports that Qwen3.5-9B scores 73.4% on AIME and 90.7% on GSM-Plus, while Ministral 3 (14B) scores 38.3% on AIME and 86.5% on GSM-Plus.
Technical Details
The model has a vocabulary size of 98,304 tokens and uses bfloat16 precision. It can be served with vLLM using the Qwen3 reasoning parser and supports tool calling with the Hermes parser.
JetBrains has released the model under the Apache 2.0 license. The company also offers a standard "Instruct" variant for direct, low-latency answers without reasoning traces, though pricing has not been disclosed for either version.
Model Family
Mellum2 includes six checkpoints: Base Pretrain, Base (final base model), Instruct SFT, Thinking SFT, Instruct (RL-tuned), and Thinking (RL-tuned). The architecture uses an MoE intermediate size of 896 compared to a standard intermediate size of 7,168 for dense layers.
What This Means
JetBrains' entry into reasoning models puts a developer-tools company directly into competition with Anthropic, OpenAI, and DeepSeek in the chain-of-thought reasoning space. The 131K context window and Apache 2.0 license make it particularly attractive for developers working with large codebases who want self-hosted reasoning capabilities. However, the benchmark scores trail leading models like Qwen3.5-9B on math tasks, suggesting it may be better suited for coding and debugging workflows than pure reasoning tasks.
Related Articles
JetBrains Releases Mellum2: 12B MoE Model With 2.5B Active Parameters for Code and Text
JetBrains has released Mellum2, a 12-billion parameter Mixture-of-Experts model that activates only 2.5 billion parameters per token. The open-source model is designed for code generation, RAG pipelines, and agent workflows with 2x faster inference than similar-sized models.
Mistral AI Releases Small 4: 119B Parameter Open-Source Model with 256K Context Under Apache 2.0
Mistral AI has released Mistral Small 4, a 119B total parameter mixture-of-experts model with 256K context window and native multimodal capabilities. The model uses 128 experts with 4 active per token (6B active parameters) and is released under the Apache 2.0 license, marking Mistral's first unified model combining reasoning, multimodal, and coding capabilities.
Microsoft to announce MAI-Thinking-1 reasoning model and Windows 11 developer mode at Build
Microsoft will announce MAI-Thinking-1 at its Build conference on June 2, 2026, according to sources cited by The Verge. The model is Microsoft's first reasoning model and was not trained using distillation from other AI models. The company will also reveal MAI-Image-2.5 and MAI-Image-2.5-Flash image models, along with a new developer-optimized Windows 11 experience.
NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architectur
NVIDIA has released Cosmos 3, a unified omni-model that combines world generation, physical reasoning, and action generation in a single architecture. Available in 8B (Nano) and 32B (Super) parameter versions on Hugging Face, Cosmos 3 uses a Mixture-of-Transformers architecture to process text, image, video, audio, and action modalities without switching between separate models.
Comments
Loading...