model releaseNVIDIA

NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model for 40 Languages

TL;DR

NVIDIA released Nemotron 3.5 ASR, a 600M-parameter speech-to-text model supporting 40 language-locales from a single checkpoint. The model achieves 0.07 seconds to final transcript after speech ends and ranks 2nd in latency among streaming ASR models according to Artificial Analysis benchmarks.

June 4, 2026 · 1:06 PM3 min read

Nemotron 3.5 ASR — Quick Specs

Compare Nemotron 3.5 ASR with other models →

NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model for 40 Languages

NVIDIA has released Nemotron 3.5 ASR, a 600M-parameter speech-to-text model that transcribes 40 language-locales from a single checkpoint with built-in punctuation and capitalization. The model is available as open weights on Hugging Face.

Performance Benchmarks

According to independent benchmarks from Artificial Analysis, the model's predecessor Nemotron 3 ASR ranks 2nd in latency among all streaming ASR models with 0.07 seconds to final transcript after end of speech. The model places in the "most attractive quadrant" of the AA-WER Streaming Index vs. Time to Final Transcription leaderboard for combined accuracy-latency performance.

Technical Architecture

The model uses a Cache-Aware FastConformer-RNNT architecture with two main components:

24-layer Cache-Aware FastConformer encoder: Processes each audio frame exactly once by caching self-attention and convolution activations from previous frames, eliminating redundant recomputation
RNNT decoder: Emits text frame-by-frame as audio streams in for live transcription

The architecture addresses a fundamental problem in streaming ASR: most systems re-process overlapping windows of audio repeatedly, burning compute and adding latency. Nemotron 3.5 ASR's caching approach processes audio once without overlap.

Supported Languages

The single 600M-parameter checkpoint supports English (US/GB), Spanish (US/ES), German, French (FR/CA), Italian, Arabic, Japanese, Korean, Portuguese (BR/PT), Russian, Hindi, Turkish, Vietnamese, Dutch, Ukrainian, Polish, Finnish, Mandarin, Czech, Bulgarian, Slovak, Swedish, Croatian, Romanian, Estonian, Danish, Hungarian, Norwegian Bokmål, Norwegian Nynorsk, Hebrew, Greek, Lithuanian, Latvian, Maltese, Slovenian, and Thai.

Configurable Latency

The model exposes an attention context size parameter that allows developers to choose operating points from 80ms (ultra-low latency) to 1.12 seconds (highest accuracy) using the same checkpoint:

[56, 0]: 80ms latency for ultra-low latency voice agents
[56, 1]: 160ms for interactive voice agents
[56, 3]: 320ms balanced mode for conversational AI
[56, 6]: 560ms for high accuracy with reasonable latency
[56, 13]: 1.12s for highest accuracy

Language Detection and Fine-Tuning

The model operates in two modes: explicit language specification (target_lang=en-US) for best accuracy when the input language is known, or automatic language detection (target_lang=auto) when the language is unknown.

According to NVIDIA, the model can be fine-tuned for specific languages, domains, or accents. The company demonstrated fine-tuning on Greek and Bulgarian to improve performance on mid-resource European languages.

Deployment and Availability

The model ships as open weights on Hugging Face and as a NeMo checkpoint. It runs on-premises without API dependencies or per-call billing. The model requires mono-channel .wav audio input and uses NeMo's standard JSON-lines manifest format.

Pricing for inference has not been disclosed. The model was trained on a mix of public and proprietary speech data across all supported languages, normalized to punctuated, properly-cased text.

What This Means

Nemotron 3.5 ASR collapses four traditional multilingual ASR problems into one model: multiple model deployments, streaming-vs-accuracy tradeoffs, separate post-processing pipelines, and language detection requirements. The cache-aware architecture delivers a genuine technical improvement over window-based streaming approaches that dominated the field. For developers building multilingual voice products, the open weights and configurable latency-accuracy tradeoff represent a practical alternative to API-based services, particularly for on-premises or privacy-sensitive deployments.

Source: huggingface.co ↗

NVIDIA speech-recognition ASR multilingual open-weights streaming NeMo Nemotron

benchmarkJuly 16, 2026

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA's Nemotron-3-Embed-8B-BF16 ranks #1 on the RTEB leaderboard with a 78.5% score, while the 1B variant reduces error rate by 27% over its predecessor. The open-weight models feature 32k context windows and production-ready deployment options including a Blackwell-optimized NVFP4 variant.

product updateJuly 17, 2026

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for distributed video and image model fine-tuning

NVIDIA and Hugging Face have integrated NeMo Automodel with the Diffusers library, enabling distributed fine-tuning of video and image diffusion models without checkpoint conversion. The integration supports models including FLUX.1-dev (12B), Wan 2.1 (1.3B/14B), and HunyuanVideo (13B) with full fine-tuning and LoRA options.

model releaseJuly 16, 2026

Moonshot AI releases 2.8T parameter Kimi K3, pricing at $3/$15 per million tokens

Chinese AI lab Moonshot AI released Kimi K3, a 2.8 trillion parameter model priced at $3 per million input tokens and $15 per million output tokens. The model is currently available via API, with open weights promised by July 27, 2026. This represents the most expensive pricing from a Chinese AI lab to date, matching Anthropic's Claude Sonnet series.

model releaseJuly 16, 2026

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Thinking Machines Lab released Inkling, a Mixture-of-Experts transformer with 975B total parameters and 41B active parameters, trained on 45 trillion tokens of text, images, audio and video. The Apache-2.0 licensed model is designed as a base for fine-tuning rather than a frontier model.

NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model for 40 Languages

Nemotron 3.5 ASR — Quick Specs

NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model for 40 Languages

Performance Benchmarks

Technical Architecture

Supported Languages

Configurable Latency

Language Detection and Fine-Tuning

Deployment and Availability

What This Means

Related Articles

NVIDIA Nemotron 3 Embed 8B Tops RTEB Leaderboard with 78.5% Score, 1B Variant Cuts Error Rate 27%

NVIDIA NeMo Automodel integrates with Hugging Face Diffusers for distributed video and image model fine-tuning

Moonshot AI releases 2.8T parameter Kimi K3, pricing at $3/$15 per million tokens

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Comments