IBM Releases 97M-Parameter Granite Embedding Model With 60.3 MTEB Score — Highest Retrieval Quality Under 100M Parameter
IBM released two new multilingual embedding models under Apache 2.0: a 97M-parameter compact model scoring 60.3 on MTEB Multilingual Retrieval (highest in its size class) and a 311M full-size model scoring 65.2. Both support 200+ languages with enhanced retrieval for 52 languages, handle 32K-token context (64x increase over predecessors), and include code retrieval across 9 programming languages.
Granite Embedding 97M Multilingual R2 — Quick Specs
IBM Releases 97M-Parameter Granite Embedding Model With 60.3 MTEB Score — Highest Retrieval Quality Under 100M Parameters
IBM has released Granite Embedding Multilingual R2, a pair of multilingual embedding models that address a persistent gap in the embedding space: the 97M-parameter compact model achieves 60.3 on MTEB Multilingual Retrieval across 18 languages, the highest score for any open multilingual embedding model under 100M parameters. The next-best model in that size class, multilingual-e5-small, scores 50.9 — a 9.4 point gap.
Two Models, Both Apache 2.0
granite-embedding-311m-multilingual-r2: 311M parameters, 768-dimensional embeddings, scores 65.2 on MTEB Multilingual Retrieval (second among open models under 500M parameters). Includes Matryoshka dimension support.
granite-embedding-97m-multilingual-r2: 97M parameters, 384-dimensional embeddings, scores 60.3 on the same benchmark. Retains majority of full-size model's quality at one-third the size.
Both models support 200+ languages with enhanced retrieval quality for 52 specifically tuned languages, handle context lengths up to 32,768 tokens (a 64x increase over their R1 predecessors), and include code retrieval across 9 programming languages: Python, Go, Java, JavaScript, PHP, Ruby, SQL, C, and C++.
Architecture: ModernBERT Replaces XLM-RoBERTa
The R2 generation rebuilds from the ground up. IBM replaced the XLM-RoBERTa encoder (512-token context) with ModernBERT, a recent architecture that integrates five years of transformer research advances: alternating attention lengths reduce computation on long sequences, rotary position embeddings enable the 32K context window without positional interpolation, and Flash Attention 2.0 support speeds encoding on modern GPUs.
The 311M model uses the Gemma 3 tokenizer (262K tokens). The 97M model uses a pruned GPT-OSS tokenizer (180K tokens) designed to preserve multilingual coverage while reducing embedding table size.
Training Pipeline
According to IBM, the 311M model underwent a multi-stage training process:
- Knowledge distillation from Granite 3.3 Instruct and Mistral v0.2 Instruct decoder models, fine-tuned for embeddings
- Contrastive fine-tuning on multilingual retrieval pairs across 52 languages and code
- Model merging of checkpoints from different training stages
The 97M model was derived through a novel pruning methodology from the 311M architecture. IBM states it intentionally avoided MS-MARCO training data and datasets with non-commercial licensing restrictions.
Performance Gains Over R1
The 97M model shows a +12.2 point gain on MTEB Multilingual Retrieval over its R1 predecessor. The 311M model gains +13.0 points over its R1 version, moving from 52.2 to 65.2.
Both models ship with ONNX and OpenVINO weights for CPU-optimized inference and work as drop-in replacements in LangChain, LlamaIndex, Haystack, and Milvus with a single model name change.
What This Means
The 97M model sets a new efficiency benchmark for multilingual embeddings. A 9.4-point MTEB lead over the previous best sub-100M model (multilingual-e5-small) represents a meaningful quality jump in a size class where trade-offs typically force choosing between speed and accuracy. The 32K context window addresses a real limitation — previous 512-token windows forced chunking strategies that degraded retrieval quality on long documents.
The Apache 2.0 license and explicit avoidance of restrictive training data make these commercially deployable without licensing concerns. For framework developers, the models' compatibility as drop-in replacements means adding 200+ language support requires changing a single model identifier.
Both models are available on Hugging Face.
Related Articles
IBM Releases Granite Embedding 311M R2 With 32K Context, 200+ Language Support
IBM released Granite Embedding 311M Multilingual R2, a 311-million parameter dense embedding model with 32,768-token context length and support for 200+ languages. The model scores 64.0 on Multilingual MTEB Retrieval (18 tasks), an 11.8-point improvement over its predecessor, and ships with ONNX and OpenVINO models for production deployment.
Supertone releases Supertonic 3: 99M-parameter on-device TTS model supporting 31 languages
Supertone has released Supertonic 3, a 99M-parameter text-to-speech model that runs entirely on-device using ONNX Runtime. The model expands language support from 5 to 31 languages compared to Supertonic 2, requires no GPU, and claims competitive accuracy against models 7-20x larger.
Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction
Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.
Baidu Releases Qianfan-OCR-Fast Model with 66K Context at $0.68 Per 1M Input Tokens
Baidu has released Qianfan-OCR-Fast, a multimodal model specialized for optical character recognition tasks. The model offers a 66,000 token context window and is priced at $0.68 per 1M input tokens and $2.81 per 1M output tokens.
Comments
Loading...