IBM's Granite 4.1: 8B Dense Model Matches 32B MoE Performance on 15T Tokens
IBM released Granite 4.1, a family of dense decoder-only LLMs (3B, 8B, 30B parameters) trained on approximately 15 trillion tokens using a five-phase pre-training pipeline. The 8B instruct model matches or surpasses the previous Granite 4.0-H-Small (32B-A9B MoE) despite using fewer parameters and a simpler dense architecture. All models support up to 512K context windows and are released under Apache 2.0 license.
IBM's Granite 4.1: 8B Dense Model Matches 32B MoE Performance on 15T Tokens
IBM released Granite 4.1, a family of dense decoder-only LLMs with 3B, 8B, and 30B parameter variants trained on approximately 15 trillion tokens. The 8B instruct model matches or surpasses the previous Granite 4.0-H-Small (32B-A9B MoE) despite using fewer parameters and a simpler dense architecture.
Architecture and Context
All three models use decoder-only dense transformer architecture with Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm. The 3B model has 2560 embedding size with 40 layers, while both 8B and 30B use 4096 embedding size with 40 and 64 layers respectively. All variants use 8 KV heads for efficient attention.
Context windows extend to 512K tokens through a staged long-context extension process. According to IBM's RULER benchmark results, the 8B base model achieves 83.6% at 32K, 79.1% at 64K, and 73.0% at 128K context lengths. The 30B model scores 85.2%, 84.6%, and 76.7% at the same context lengths.
Five-Phase Training Pipeline
IBM trained Granite 4.1 using a five-phase strategy that progressively shifts from broad web data to curated domain-specific content:
Phase 1 (10T tokens): General pre-training with 59% CommonCrawl, 20% code, 7% math, 10.5% technical documentation, 2% multilingual, and 1.5% domain-specific data.
Phase 2 (2T tokens): Math and code emphasis increases to 35% math (5x increase) and 30% code (1.5x increase), alongside 12% high-quality CommonCrawl and 9% synthetic data.
Phase 3 (2T tokens): Mid-training annealing introduces 12.5% long chain-of-thought reasoning trajectories and 12% instruction data (7.5% language, 4.5% code) while balancing CommonCrawl-HQ, math, and code at 16.67% each.
Phase 4 (0.5T tokens): Refinement phase with 40% CommonCrawl-HQ, 20% code, 20% math, and reduced instruction/reasoning data with linear learning rate decay to zero.
Phase 5: Long-context extension (LCE) staged from 4K to 32K, 128K, and 512K tokens. The 512K extension for 8B and 30B models uses 80% books and 20% code repositories.
Data Quality Controls
IBM applied supervised fine-tuning on approximately 4.1 million curated samples using an LLM-as-Judge framework. The system evaluates responses across six weighted dimensions: instruction following, correctness, completeness, conciseness, naturalness, and calibration. Hard-reject rules automatically filter severe defects including hallucinations, false premises, or incorrect computations regardless of score.
The framework uses specialized judge prompts for multi-turn dialogue, RAG-augmented responses, tool-calling interactions, and multilingual conversations. In RAG settings, responses not grounded in retrieved context are flagged as hallucinations. Tool-use outputs are validated against allowed tools and parameter schemas.
Reinforcement learning uses on-policy GRPO (Group Relative Policy Optimization) with DAPO loss to strengthen performance in math, coding, instruction following, and general chat.
What This Means
Granite 4.1's achievement of matching a 32B MoE model with an 8B dense architecture demonstrates that careful data curation and multi-stage training can compete with mixture-of-experts approaches. The progressive data mixture strategy—starting broad and narrowing to high-quality domain-specific content—provides a replicable blueprint for training smaller models efficiently.
The Apache 2.0 license removes deployment restrictions, making these models particularly relevant for enterprise use cases where licensing constraints matter. The 512K context window positions Granite 4.1 for long-document processing tasks, though real-world performance at extreme context lengths will depend on specific use cases. The detailed technical documentation, including exact data percentages and training phases, is unusually transparent for an enterprise model release.
Related Articles
IBM releases Bob AI coding assistant after testing on 80,000 employees, claims 45% productivity gains
IBM has launched Bob, its AI coding assistant, following internal testing with 80,000 employees. The company claims teams saw average productivity gains of 45% across complex workflows. Pricing ranges from $20 to $200 per month using a "Bobcoin" credit system.
Poolside releases Laguna XS.2: 33B parameter MoE coding model with 131K context window
Poolside has released Laguna XS.2, a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token, designed for agentic coding. The model features a 131,072-token context window, scores 68.2% on SWE-bench Verified, and is available under Apache 2.0 license with free API access.
Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window
Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window and was trained on 48T tokens using FP8 mixed precision.
Xiaomi Releases MiMo-V2.5-Pro: 1.02T Parameter MoE Model with 1M Context Window
Xiaomi has released MiMo-V2.5-Pro, an open-source Mixture-of-Experts model with 1.02 trillion total parameters and 42 billion active parameters. The model supports up to 1 million tokens context length and claims 99.6% on GSM8K and 86.2% on MATH benchmarks.
Comments
Loading...