NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model for 40 Languages
NVIDIA released Nemotron 3.5 ASR, a 600M-parameter speech-to-text model supporting 40 language-locales from a single checkpoint. The model achieves 0.07 seconds to final transcript after speech ends and ranks 2nd in latency among streaming ASR models according to Artificial Analysis benchmarks.
NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model for 40 Languages
NVIDIA has released Nemotron 3.5 ASR, a 600M-parameter speech-to-text model that transcribes 40 language-locales from a single checkpoint with built-in punctuation and capitalization. The model is available as open weights on Hugging Face.
Performance Benchmarks
According to independent benchmarks from Artificial Analysis, the model's predecessor Nemotron 3 ASR ranks 2nd in latency among all streaming ASR models with 0.07 seconds to final transcript after end of speech. The model places in the "most attractive quadrant" of the AA-WER Streaming Index vs. Time to Final Transcription leaderboard for combined accuracy-latency performance.
Technical Architecture
The model uses a Cache-Aware FastConformer-RNNT architecture with two main components:
- 24-layer Cache-Aware FastConformer encoder: Processes each audio frame exactly once by caching self-attention and convolution activations from previous frames, eliminating redundant recomputation
- RNNT decoder: Emits text frame-by-frame as audio streams in for live transcription
The architecture addresses a fundamental problem in streaming ASR: most systems re-process overlapping windows of audio repeatedly, burning compute and adding latency. Nemotron 3.5 ASR's caching approach processes audio once without overlap.
Supported Languages
The single 600M-parameter checkpoint supports English (US/GB), Spanish (US/ES), German, French (FR/CA), Italian, Arabic, Japanese, Korean, Portuguese (BR/PT), Russian, Hindi, Turkish, Vietnamese, Dutch, Ukrainian, Polish, Finnish, Mandarin, Czech, Bulgarian, Slovak, Swedish, Croatian, Romanian, Estonian, Danish, Hungarian, Norwegian Bokmål, Norwegian Nynorsk, Hebrew, Greek, Lithuanian, Latvian, Maltese, Slovenian, and Thai.
Configurable Latency
The model exposes an attention context size parameter that allows developers to choose operating points from 80ms (ultra-low latency) to 1.12 seconds (highest accuracy) using the same checkpoint:
- [56, 0]: 80ms latency for ultra-low latency voice agents
- [56, 1]: 160ms for interactive voice agents
- [56, 3]: 320ms balanced mode for conversational AI
- [56, 6]: 560ms for high accuracy with reasonable latency
- [56, 13]: 1.12s for highest accuracy
Language Detection and Fine-Tuning
The model operates in two modes: explicit language specification (target_lang=en-US) for best accuracy when the input language is known, or automatic language detection (target_lang=auto) when the language is unknown.
According to NVIDIA, the model can be fine-tuned for specific languages, domains, or accents. The company demonstrated fine-tuning on Greek and Bulgarian to improve performance on mid-resource European languages.
Deployment and Availability
The model ships as open weights on Hugging Face and as a NeMo checkpoint. It runs on-premises without API dependencies or per-call billing. The model requires mono-channel .wav audio input and uses NeMo's standard JSON-lines manifest format.
Pricing for inference has not been disclosed. The model was trained on a mix of public and proprietary speech data across all supported languages, normalized to punctuated, properly-cased text.
What This Means
Nemotron 3.5 ASR collapses four traditional multilingual ASR problems into one model: multiple model deployments, streaming-vs-accuracy tradeoffs, separate post-processing pipelines, and language detection requirements. The cache-aware architecture delivers a genuine technical improvement over window-based streaming approaches that dominated the field. For developers building multilingual voice products, the open weights and configurable latency-accuracy tradeoff represent a practical alternative to API-based services, particularly for on-premises or privacy-sensitive deployments.
Related Articles
Nvidia Releases Nemotron 3 Ultra: 550B Parameter MoE Model with 1M Token Context Window
Nvidia has released Nemotron 3 Ultra, a 550B parameter mixture-of-experts model with 55B active parameters and a 1M token context window. The model uses a hybrid Transformer-Mamba architecture and is available for free through OpenRouter, targeting agentic workflows and multi-step reasoning tasks.
NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA
NVIDIA demonstrated that task-seeded synthetic Q&A data improves model performance across multiple benchmarks in a 100B-token continuation experiment on Nemotron-3 Nano. The approach improved GPQA scores by +11.1 points, MMLU-Pro by +1.8, average code by +1.9, and commonsense understanding by +1.6.
Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters
Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.
Microsoft releases MAI-Thinking-1, its first reasoning model with 35B parameters
Microsoft released seven AI models at Build 2026, headlined by MAI-Thinking-1, its first reasoning model with 35 billion parameters. The company claims the model matches Anthropic's Claude Opus 4.6 on SWE Bench Pro coding benchmarks and beats Sonnet 4.61 in blind tests.
Comments
Loading...