IBM Releases Granite Speech 4.1 2B: 2-Billion-Parameter Multilingual Speech Model with Non-Autoregressive Variant
IBM has released Granite Speech 4.1 2B, a 2-billion-parameter speech-language model trained on 174,000 hours of audio for automatic speech recognition and translation across English, French, German, Spanish, Portuguese, and Japanese. The model introduces a dual-head CTC encoder and includes variants for speaker attribution and a novel non-autoregressive architecture for higher throughput.
Granite Speech 4.1 2B — Quick Specs
IBM Releases Granite Speech 4.1 2B: 2-Billion-Parameter Multilingual Speech Model with Non-Autoregressive Variant
IBM has released Granite Speech 4.1 2B, a 2-billion-parameter speech-language model designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST). The model supports English, French, German, Spanish, Portuguese, and Japanese, and was trained on 174,000 hours of audio from public corpora and synthetic datasets.
Technical Architecture
The model was built by modality-aligning an intermediate checkpoint of granite-4.0-1b-base to speech. According to IBM, the new naming convention reflects actual parameter count rather than base LLM size. Key architectural improvements over the predecessor include:
- Dual-head CTC encoder with both graphemic and BPE outputs
- Frame importance sampling to focus on informative audio segments
- Punctuation and truecasing across all supported languages, including German noun capitalization
IBM offers two additional variants: granite-speech-4.1-2b-plus adds speaker-attributed ASR and word-level timestamps, while granite-speech-4.1-2b-nar introduces a non-autoregressive architecture designed for higher throughput.
Benchmark Performance
IBM evaluated the model against other speech-language models under 8 billion parameters. On the Open ASR leaderboard (as of April 2026), the model demonstrates competitive performance across standard benchmarks.
For punctuation accuracy, the model achieved a punctuation error rate (PER) ranging from 3.66 on German (CV-DE) to 25.70 on LibriSpeech-clean. Capitalization F1 scores ranged from 89.71 to 99.50, with the highest score on German where noun capitalization is required.
The model's keyword list biasing capability was evaluated using F1 scores of transcribed keywords during ASR tasks, excluding common words. IBM reports improved recognition of names, acronyms, and technical jargon compared to inference without keyword biasing.
Training and Capabilities
The 174,000-hour training dataset included:
- Public corpora for ASR and AST
- Synthetic datasets for Japanese ASR
- Data tailored for keyword-biased ASR and speech translation
Beyond the six primary languages for ASR and AST, the model claims support for English-to-Italian and English-to-Mandarin translation.
Integration and Licensing
The model is available under Apache 2.0 license and supported natively in transformers>=4.52.1. IBM provides integration examples for both transformers and vLLM deployment, including online and offline inference modes.
What This Means
IBM's release of three model variants—standard, speaker-attributed, and non-autoregressive—addresses different deployment scenarios from accuracy-focused to throughput-optimized applications. The dual-head CTC encoder and frame importance sampling represent architectural refinements aimed at improving multilingual ASR accuracy. The non-autoregressive variant is particularly notable as an alternative to standard autoregressive decoding for speech tasks. At 2 billion parameters, the model targets enterprise applications requiring on-premise deployment with moderate computational resources.
Related Articles
IBM Releases Granite 4.1 30B With 131K Context Window and Enhanced Tool-Calling
IBM released Granite 4.1 30B, a 30-billion parameter instruction-following model with a 131,072 token context window. The model scores 80.16 on MMLU 5-shot and 88.41 on HumanEval pass@1, with enhanced tool-calling capabilities following OpenAI's function definition schema.
IBM Releases Granite 4.1 8B with 131K Context Window at $0.05/M Input Tokens
IBM has released Granite 4.1 8B, an 8-billion-parameter decoder-only language model with a 131,072-token context window. The model supports 12 languages and costs $0.05 per million input tokens and $0.10 per million output tokens, available under the Apache 2.0 license.
IBM releases Granite 4.1-8B with 131K context window and enhanced tool-calling capabilities
IBM has released Granite 4.1-8B, an 8-billion parameter long-context model with a 131,072-token context window. The model achieves 85.37% on HumanEval and 73.84% on MMLU 5-shot, with enhanced tool-calling capabilities reaching 68.27% on BFCL v3. Released under Apache 2.0 license, it supports 12 languages.
IBM releases Apache 2.0 Granite 4.1 LLMs in 3B, 8B, and 30B sizes
IBM has released the Granite 4.1 family of language models under Apache 2.0 license. The models come in 3B, 8B, and 30B parameter sizes. Unsloth has released 21 GGUF quantized variants of the 3B model ranging from 1.2GB to 6.34GB.
Comments
Loading...