model releaseIbm

IBM Releases 97M-Parameter Granite Embedding Model With 60.3 MTEB Score — Highest Retrieval Quality Under 100M Parameter

TL;DR

IBM released two new multilingual embedding models under Apache 2.0: a 97M-parameter compact model scoring 60.3 on MTEB Multilingual Retrieval (highest in its size class) and a 311M full-size model scoring 65.2. Both support 200+ languages with enhanced retrieval for 52 languages, handle 32K-token context (64x increase over predecessors), and include code retrieval across 9 programming languages.

3 min read
0

IBM Releases 97M-Parameter Granite Embedding Model With 60.3 MTEB Score — Highest Retrieval Quality Under 100M Parameters

IBM has released Granite Embedding Multilingual R2, a pair of multilingual embedding models that address a persistent gap in the embedding space: the 97M-parameter compact model achieves 60.3 on MTEB Multilingual Retrieval across 18 languages, the highest score for any open multilingual embedding model under 100M parameters. The next-best model in that size class, multilingual-e5-small, scores 50.9 — a 9.4 point gap.

Two Models, Both Apache 2.0

granite-embedding-311m-multilingual-r2: 311M parameters, 768-dimensional embeddings, scores 65.2 on MTEB Multilingual Retrieval (second among open models under 500M parameters). Includes Matryoshka dimension support.

granite-embedding-97m-multilingual-r2: 97M parameters, 384-dimensional embeddings, scores 60.3 on the same benchmark. Retains majority of full-size model's quality at one-third the size.

Both models support 200+ languages with enhanced retrieval quality for 52 specifically tuned languages, handle context lengths up to 32,768 tokens (a 64x increase over their R1 predecessors), and include code retrieval across 9 programming languages: Python, Go, Java, JavaScript, PHP, Ruby, SQL, C, and C++.

Architecture: ModernBERT Replaces XLM-RoBERTa

The R2 generation rebuilds from the ground up. IBM replaced the XLM-RoBERTa encoder (512-token context) with ModernBERT, a recent architecture that integrates five years of transformer research advances: alternating attention lengths reduce computation on long sequences, rotary position embeddings enable the 32K context window without positional interpolation, and Flash Attention 2.0 support speeds encoding on modern GPUs.

The 311M model uses the Gemma 3 tokenizer (262K tokens). The 97M model uses a pruned GPT-OSS tokenizer (180K tokens) designed to preserve multilingual coverage while reducing embedding table size.

Training Pipeline

According to IBM, the 311M model underwent a multi-stage training process:

  1. Knowledge distillation from Granite 3.3 Instruct and Mistral v0.2 Instruct decoder models, fine-tuned for embeddings
  2. Contrastive fine-tuning on multilingual retrieval pairs across 52 languages and code
  3. Model merging of checkpoints from different training stages

The 97M model was derived through a novel pruning methodology from the 311M architecture. IBM states it intentionally avoided MS-MARCO training data and datasets with non-commercial licensing restrictions.

Performance Gains Over R1

The 97M model shows a +12.2 point gain on MTEB Multilingual Retrieval over its R1 predecessor. The 311M model gains +13.0 points over its R1 version, moving from 52.2 to 65.2.

Both models ship with ONNX and OpenVINO weights for CPU-optimized inference and work as drop-in replacements in LangChain, LlamaIndex, Haystack, and Milvus with a single model name change.

What This Means

The 97M model sets a new efficiency benchmark for multilingual embeddings. A 9.4-point MTEB lead over the previous best sub-100M model (multilingual-e5-small) represents a meaningful quality jump in a size class where trade-offs typically force choosing between speed and accuracy. The 32K context window addresses a real limitation — previous 512-token windows forced chunking strategies that degraded retrieval quality on long documents.

The Apache 2.0 license and explicit avoidance of restrictive training data make these commercially deployable without licensing concerns. For framework developers, the models' compatibility as drop-in replacements means adding 200+ language support requires changing a single model identifier.

Both models are available on Hugging Face.

Related Articles

model release

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

model release

DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3

DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.

model release

China's Z.ai releases GLM-5.2, open-source model matching Claude and GPT-5.5 in cybersecurity tasks

Z.ai's GLM-5.2 performs on par with Claude Opus 4.8 and OpenAI's GPT-5.5 in cybersecurity benchmarks while costing roughly half as much to run. Security evaluations from Graphistry and Semgrep confirm the open-weight model's capabilities in vulnerability discovery and cyber investigation, raising concerns about accessibility of advanced hacking tools.

model release

Anthropic's Fable 5 model expected to return next week after 15-day government shutdown

The Trump administration is close to allowing Anthropic to restore access to its Fable 5 model, which has been offline for 15 days due to national security concerns. Insiders expect restrictions could be lifted as soon as next week, though Pentagon and NSA approval is still required.

Comments

Loading...