model releaseIbm

IBM's Granite 4.1: 8B Dense Model Matches 32B MoE Performance on 15T Tokens

TL;DR

IBM released Granite 4.1, a family of dense decoder-only LLMs (3B, 8B, 30B parameters) trained on approximately 15 trillion tokens using a five-phase pre-training pipeline. The 8B instruct model matches or surpasses the previous Granite 4.0-H-Small (32B-A9B MoE) despite using fewer parameters and a simpler dense architecture. All models support up to 512K context windows and are released under Apache 2.0 license.

3 min read
0

Granite 4.1 8B — Quick Specs

Context window131K tokens
Input$0.05/1M tokens
Output$0.1/1M tokens

IBM's Granite 4.1: 8B Dense Model Matches 32B MoE Performance on 15T Tokens

IBM released Granite 4.1, a family of dense decoder-only LLMs with 3B, 8B, and 30B parameter variants trained on approximately 15 trillion tokens. The 8B instruct model matches or surpasses the previous Granite 4.0-H-Small (32B-A9B MoE) despite using fewer parameters and a simpler dense architecture.

Architecture and Context

All three models use decoder-only dense transformer architecture with Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm. The 3B model has 2560 embedding size with 40 layers, while both 8B and 30B use 4096 embedding size with 40 and 64 layers respectively. All variants use 8 KV heads for efficient attention.

Context windows extend to 512K tokens through a staged long-context extension process. According to IBM's RULER benchmark results, the 8B base model achieves 83.6% at 32K, 79.1% at 64K, and 73.0% at 128K context lengths. The 30B model scores 85.2%, 84.6%, and 76.7% at the same context lengths.

Five-Phase Training Pipeline

IBM trained Granite 4.1 using a five-phase strategy that progressively shifts from broad web data to curated domain-specific content:

Phase 1 (10T tokens): General pre-training with 59% CommonCrawl, 20% code, 7% math, 10.5% technical documentation, 2% multilingual, and 1.5% domain-specific data.

Phase 2 (2T tokens): Math and code emphasis increases to 35% math (5x increase) and 30% code (1.5x increase), alongside 12% high-quality CommonCrawl and 9% synthetic data.

Phase 3 (2T tokens): Mid-training annealing introduces 12.5% long chain-of-thought reasoning trajectories and 12% instruction data (7.5% language, 4.5% code) while balancing CommonCrawl-HQ, math, and code at 16.67% each.

Phase 4 (0.5T tokens): Refinement phase with 40% CommonCrawl-HQ, 20% code, 20% math, and reduced instruction/reasoning data with linear learning rate decay to zero.

Phase 5: Long-context extension (LCE) staged from 4K to 32K, 128K, and 512K tokens. The 512K extension for 8B and 30B models uses 80% books and 20% code repositories.

Data Quality Controls

IBM applied supervised fine-tuning on approximately 4.1 million curated samples using an LLM-as-Judge framework. The system evaluates responses across six weighted dimensions: instruction following, correctness, completeness, conciseness, naturalness, and calibration. Hard-reject rules automatically filter severe defects including hallucinations, false premises, or incorrect computations regardless of score.

The framework uses specialized judge prompts for multi-turn dialogue, RAG-augmented responses, tool-calling interactions, and multilingual conversations. In RAG settings, responses not grounded in retrieved context are flagged as hallucinations. Tool-use outputs are validated against allowed tools and parameter schemas.

Reinforcement learning uses on-policy GRPO (Group Relative Policy Optimization) with DAPO loss to strengthen performance in math, coding, instruction following, and general chat.

What This Means

Granite 4.1's achievement of matching a 32B MoE model with an 8B dense architecture demonstrates that careful data curation and multi-stage training can compete with mixture-of-experts approaches. The progressive data mixture strategy—starting broad and narrowing to high-quality domain-specific content—provides a replicable blueprint for training smaller models efficiently.

The Apache 2.0 license removes deployment restrictions, making these models particularly relevant for enterprise use cases where licensing constraints matter. The 512K context window positions Granite 4.1 for long-document processing tasks, though real-world performance at extreme context lengths will depend on specific use cases. The detailed technical documentation, including exact data percentages and training phases, is unusually transparent for an enterprise model release.

Related Articles

model release

MiniMax Releases M3: 428B-Parameter Multimodal Model with 1M Context Window and 15× Decode Speedup

MiniMax has released M3, a multimodal model with approximately 428 billion parameters and 23 billion activated parameters. The model supports a 1 million token context window and uses MiniMax Sparse Attention to achieve 9× prefill and 15× decode speedups compared to its predecessor M2.

model release

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

model release

Apple releases AFM 3 lineup: 20B-parameter on-device model and cloud AI running on Google's Nvidia infrastructure

Apple announced five third-generation foundation models at WWDC26, headlined by AFM 3 Core Advanced—a 20-billion-parameter sparse model that runs on-device by activating only 1-4 billion parameters at a time. For the first time, Apple extended Private Cloud Compute to third-party infrastructure, with AFM 3 Cloud Pro running on Nvidia GPUs in Google Cloud.

model release

Google releases DiffusionGemma 26B, open-weight model generates 500+ tokens/second

Google has released DiffusionGemma 26B, an open-weight text generation model under Apache 2 license. The model generates over 500 tokens/second according to testing on NVIDIA's free NIM API, where it produced 2,409 tokens in 4.4 seconds.

Comments

Loading...