Alibaba's Qwen3.5-Omni learns to write code from speech and video without explicit training
Alibaba has released Qwen3.5-Omni, an omnimodal model handling text, images, audio, and video with a 256,000-token context window. The model reportedly outperforms Google's Gemini 3.1 Pro on audio tasks with support for 74 languages in speech recognition, a 6x increase from its predecessor. An unexpected emergent capability: writing working code from spoken instructions and video input, which the team did not explicitly train.
Alibaba's Qwen3.5-Omni Learns Code Generation From Speech and Video Without Training
Alibaba has released Qwen3.5-Omni, an omnimodal AI model that processes text, images, audio, and video simultaneously. The model demonstrates an unexpected emergent capability: writing functional code from spoken instructions and video input—a skill the Qwen team did not explicitly train it to perform.
Audio Performance Beats Gemini 3.1 Pro
The Qwen team claims Qwen3.5-Omni-Plus achieves state-of-the-art results across 215 audio and audiovisual tasks. Specific benchmark results:
- Audio Comprehension (MMAU): 82.2 vs. Gemini 3.1 Pro's 81.1
- Music Comprehension (RUL-MuchoMusic): 72.4 vs. 59.6
- Dialog (VoiceBench): 93.1 vs. 88.9
- Speech Recognition (Fleurs top 60 languages): 6.55 word error rate vs. Gemini 3.1 Pro's 7.32
- Cantonese Recognition: 1.95 vs. 13.40 word error rate
Speech generation performance on the "seed-hard" test set shows a word error rate of 6.24, outperforming GPT-Audio (8.19), Minimax (8.62), and ElevenLabs (27.70). For multilingual voice cloning across 20 languages, the model achieves a word error rate of 1.87 and cosine similarity of 0.79.
Massive Language Expansion
Speech recognition coverage expanded dramatically from 11 languages to 74 languages plus 39 Chinese dialects (113 total). Voice output supports 36 languages and dialects with 55 available voices including dialectal and multilingual options. The context window increased from 32,000 to 256,000 tokens, enabling processing of more than 10 hours of audio and over 400 seconds of 720p video at one frame per second.
Emergent Code Generation Capability
The model demonstrates an unexpected skill: "audio-visual vibe coding." In demonstrations, Qwen3.5-Omni-Plus builds a functional snake game from verbal description and a video clip. The team claims this capability emerged as a byproduct of native omnimodal scaling and was not explicitly trained.
Beyond code generation, the model automatically segments video content with second-accurate timestamps, provides detailed scene breakdowns identifying speakers, cuts, and sound effects, and can flag sensitive content for moderation purposes.
Architecture Improvements
The model retains a thinker-talker design where the thinker processes multimodal input and generates text, while the talker converts output to contextual speech. Both components now use a hybrid attention-MoE (mixture-of-experts) architecture replacing the pure MoE setup from Qwen3-Omni.
The primary technical upgrade is ARIA (Adaptive Rate Interleave Alignment), which dynamically aligns and interleaves text and voice tokens to address the persistent problem of dropped words, mispronunciations, and garbled numbers in real-time voice output. The predecessor used a rigid 1:1 mapping between text and audio tokens.
Real-Time Conversation Features
Qwen3.5-Omni adds "semantic interruption" to distinguish user intent from background noise, automatic web search for current information, and function calling support. Users can adjust voice characteristics (volume, tempo, emotion) via voice commands and upload custom voice samples for voice cloning.
API-Only Release, No Open Weights
Unlike previous Qwen releases, Alibaba has not published model weights or announced a license. Qwen3.5-Omni is accessible only through API via the Qwen Chat interface and Alibaba Cloud Model Studio. Pricing information has not been disclosed.
Context and Leadership Changes
The release occurs amid significant leadership turbulence. Junyang Lin, Alibaba's chief AI developer and the driving force behind the Qwen series, announced a surprise departure. Key team leads for Qwen coders, post-training, and Qwen 3.5/VL followed. The exits reportedly stemmed from an internal restructuring that would have placed a researcher from Google's Gemini team in charge. CEO Eddie Wu responded by establishing a new "Foundation Model Task Force," reaffirming foundation model development as a core strategic priority.
What This Means
Qwen3.5-Omni represents a significant advance in omnimodal AI performance, particularly for audio and speech tasks where it demonstrably exceeds Gemini 3.1 Pro across multiple benchmarks. The emergent code-generation capability—unintentional but functional—suggests that scaling native multimodal training produces capabilities beyond what architects explicitly design for. The API-only distribution strategy diverges from Alibaba's open-source positioning and limits external verification of claimed performance. The leadership exodus raises questions about execution continuity despite management reassurances.
Related Articles
DeepReinforce Releases Ornith-1.0, Open-Source Agentic Coding Model in 9B to 397B Sizes
DeepReinforce has released Ornith-1.0, an MIT-licensed model designed for agentic coding tasks with variants ranging from 9B to 397B parameters. Built on top of Apache 2.0-licensed Gemma 4 and Qwen 3.5 base models, the company claims it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.
DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3
DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.
DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3
DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.
Anthropic's Fable 5 model expected to return next week after 15-day government shutdown
The Trump administration is close to allowing Anthropic to restore access to its Fable 5 model, which has been offline for 15 days due to national security concerns. Insiders expect restrictions could be lifted as soon as next week, though Pentagon and NSA approval is still required.
Comments
Loading...