model releaseIbm

IBM Releases Granite Speech 4.1 2B: 2-Billion-Parameter Multilingual Speech Model with Non-Autoregressive Variant

TL;DR

IBM has released Granite Speech 4.1 2B, a 2-billion-parameter speech-language model trained on 174,000 hours of audio for automatic speech recognition and translation across English, French, German, Spanish, Portuguese, and Japanese. The model introduces a dual-head CTC encoder and includes variants for speaker attribution and a novel non-autoregressive architecture for higher throughput.

2 min read
0

IBM Releases Granite Speech 4.1 2B: 2-Billion-Parameter Multilingual Speech Model with Non-Autoregressive Variant

IBM has released Granite Speech 4.1 2B, a 2-billion-parameter speech-language model designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST). The model supports English, French, German, Spanish, Portuguese, and Japanese, and was trained on 174,000 hours of audio from public corpora and synthetic datasets.

Technical Architecture

The model was built by modality-aligning an intermediate checkpoint of granite-4.0-1b-base to speech. According to IBM, the new naming convention reflects actual parameter count rather than base LLM size. Key architectural improvements over the predecessor include:

  • Dual-head CTC encoder with both graphemic and BPE outputs
  • Frame importance sampling to focus on informative audio segments
  • Punctuation and truecasing across all supported languages, including German noun capitalization

IBM offers two additional variants: granite-speech-4.1-2b-plus adds speaker-attributed ASR and word-level timestamps, while granite-speech-4.1-2b-nar introduces a non-autoregressive architecture designed for higher throughput.

Benchmark Performance

IBM evaluated the model against other speech-language models under 8 billion parameters. On the Open ASR leaderboard (as of April 2026), the model demonstrates competitive performance across standard benchmarks.

For punctuation accuracy, the model achieved a punctuation error rate (PER) ranging from 3.66 on German (CV-DE) to 25.70 on LibriSpeech-clean. Capitalization F1 scores ranged from 89.71 to 99.50, with the highest score on German where noun capitalization is required.

The model's keyword list biasing capability was evaluated using F1 scores of transcribed keywords during ASR tasks, excluding common words. IBM reports improved recognition of names, acronyms, and technical jargon compared to inference without keyword biasing.

Training and Capabilities

The 174,000-hour training dataset included:

  • Public corpora for ASR and AST
  • Synthetic datasets for Japanese ASR
  • Data tailored for keyword-biased ASR and speech translation

Beyond the six primary languages for ASR and AST, the model claims support for English-to-Italian and English-to-Mandarin translation.

Integration and Licensing

The model is available under Apache 2.0 license and supported natively in transformers>=4.52.1. IBM provides integration examples for both transformers and vLLM deployment, including online and offline inference modes.

What This Means

IBM's release of three model variants—standard, speaker-attributed, and non-autoregressive—addresses different deployment scenarios from accuracy-focused to throughput-optimized applications. The dual-head CTC encoder and frame importance sampling represent architectural refinements aimed at improving multilingual ASR accuracy. The non-autoregressive variant is particularly notable as an alternative to standard autoregressive decoding for speech tasks. At 2 billion parameters, the model targets enterprise applications requiring on-premise deployment with moderate computational resources.

Related Articles

Comments

Loading...