audio
7 articles tagged with audio
NVIDIA Releases Nemotron 3 Nano Omni: 30B-A3B Multimodal Model With 100+ Page Document Support
NVIDIA released Nemotron 3 Nano Omni, a 30B-A3B Mixture-of-Experts model that processes text, images, video, and audio. The model uses a hybrid Mamba-Transformer architecture with 128 experts and achieves 65.8 on OCRBenchV2-En and 72.2 on Video-MME, while delivering up to 9x higher throughput on multimodal tasks compared to alternatives.
Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window
Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window and was trained on 48T tokens using FP8 mixed precision.
OpenAI Makes Whisper Speech Recognition Available on OpenRouter at $0.006 per Minute
OpenAI's Whisper 1 automatic speech recognition model is now accessible through OpenRouter's API routing service. The model supports transcription and translation across 50+ languages from audio files up to 25 MB, priced at $0.006 per minute of audio.
Google DeepMind releases Gemma 4 open models with multimodal capabilities and 256K context window
Google DeepMind released the Gemma 4 family of open-source models with multimodal capabilities (text, image, audio, video) and context windows up to 256K tokens. Four distinct model sizes—E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active), and 31B—are available under the Apache 2.0 license, with instruction-tuned and pre-trained variants.
Google releases Gemma 4 family with 31B model, 256K context, multimodal capabilities
Google DeepMind released the Gemma 4 family of open-weights models ranging from 2.3B to 31B parameters, featuring up to 256K token context windows and native support for text, image, video, and audio inputs. The flagship 31B model scores 85.2% on MMLU Pro and 89.2% on AIME 2026, with a smaller 26B MoE variant requiring only 3.8B active parameters for faster inference.
Alibaba's Qwen3.5-Omni learns to write code from speech and video without explicit training
Alibaba has released Qwen3.5-Omni, an omnimodal model handling text, images, audio, and video with a 256,000-token context window. The model reportedly outperforms Google's Gemini 3.1 Pro on audio tasks with support for 74 languages in speech recognition, a 6x increase from its predecessor. An unexpected emergent capability: writing working code from spoken instructions and video input, which the team did not explicitly train.
Google releases Lyria 3 Pro Preview for full-length music generation
Google has released Lyria 3 Pro Preview, a music generation model capable of producing full-length songs with verses, choruses, bridges, vocals, and timed lyrics from text prompts or images. The model features a 1,048,576 token context window and charges $0.08 per generated song through the Gemini API.