Alibaba's Qwen3.5-Omni learns to write code from speech and video without explicit training
Alibaba has released Qwen3.5-Omni, an omnimodal model handling text, images, audio, and video with a 256,000-token context window. The model reportedly outperforms Google's Gemini 3.1 Pro on audio tasks with support for 74 languages in speech recognition, a 6x increase from its predecessor. An unexpected emergent capability: writing working code from spoken instructions and video input, which the team did not explicitly train.
Alibaba's Qwen3.5-Omni Learns Code Generation From Speech and Video Without Training
Alibaba has released Qwen3.5-Omni, an omnimodal AI model that processes text, images, audio, and video simultaneously. The model demonstrates an unexpected emergent capability: writing functional code from spoken instructions and video input—a skill the Qwen team did not explicitly train it to perform.
Audio Performance Beats Gemini 3.1 Pro
The Qwen team claims Qwen3.5-Omni-Plus achieves state-of-the-art results across 215 audio and audiovisual tasks. Specific benchmark results:
- Audio Comprehension (MMAU): 82.2 vs. Gemini 3.1 Pro's 81.1
- Music Comprehension (RUL-MuchoMusic): 72.4 vs. 59.6
- Dialog (VoiceBench): 93.1 vs. 88.9
- Speech Recognition (Fleurs top 60 languages): 6.55 word error rate vs. Gemini 3.1 Pro's 7.32
- Cantonese Recognition: 1.95 vs. 13.40 word error rate
Speech generation performance on the "seed-hard" test set shows a word error rate of 6.24, outperforming GPT-Audio (8.19), Minimax (8.62), and ElevenLabs (27.70). For multilingual voice cloning across 20 languages, the model achieves a word error rate of 1.87 and cosine similarity of 0.79.
Massive Language Expansion
Speech recognition coverage expanded dramatically from 11 languages to 74 languages plus 39 Chinese dialects (113 total). Voice output supports 36 languages and dialects with 55 available voices including dialectal and multilingual options. The context window increased from 32,000 to 256,000 tokens, enabling processing of more than 10 hours of audio and over 400 seconds of 720p video at one frame per second.
Emergent Code Generation Capability
The model demonstrates an unexpected skill: "audio-visual vibe coding." In demonstrations, Qwen3.5-Omni-Plus builds a functional snake game from verbal description and a video clip. The team claims this capability emerged as a byproduct of native omnimodal scaling and was not explicitly trained.
Beyond code generation, the model automatically segments video content with second-accurate timestamps, provides detailed scene breakdowns identifying speakers, cuts, and sound effects, and can flag sensitive content for moderation purposes.
Architecture Improvements
The model retains a thinker-talker design where the thinker processes multimodal input and generates text, while the talker converts output to contextual speech. Both components now use a hybrid attention-MoE (mixture-of-experts) architecture replacing the pure MoE setup from Qwen3-Omni.
The primary technical upgrade is ARIA (Adaptive Rate Interleave Alignment), which dynamically aligns and interleaves text and voice tokens to address the persistent problem of dropped words, mispronunciations, and garbled numbers in real-time voice output. The predecessor used a rigid 1:1 mapping between text and audio tokens.
Real-Time Conversation Features
Qwen3.5-Omni adds "semantic interruption" to distinguish user intent from background noise, automatic web search for current information, and function calling support. Users can adjust voice characteristics (volume, tempo, emotion) via voice commands and upload custom voice samples for voice cloning.
API-Only Release, No Open Weights
Unlike previous Qwen releases, Alibaba has not published model weights or announced a license. Qwen3.5-Omni is accessible only through API via the Qwen Chat interface and Alibaba Cloud Model Studio. Pricing information has not been disclosed.
Context and Leadership Changes
The release occurs amid significant leadership turbulence. Junyang Lin, Alibaba's chief AI developer and the driving force behind the Qwen series, announced a surprise departure. Key team leads for Qwen coders, post-training, and Qwen 3.5/VL followed. The exits reportedly stemmed from an internal restructuring that would have placed a researcher from Google's Gemini team in charge. CEO Eddie Wu responded by establishing a new "Foundation Model Task Force," reaffirming foundation model development as a core strategic priority.
What This Means
Qwen3.5-Omni represents a significant advance in omnimodal AI performance, particularly for audio and speech tasks where it demonstrably exceeds Gemini 3.1 Pro across multiple benchmarks. The emergent code-generation capability—unintentional but functional—suggests that scaling native multimodal training produces capabilities beyond what architects explicitly design for. The API-only distribution strategy diverges from Alibaba's open-source positioning and limits external verification of claimed performance. The leadership exodus raises questions about execution continuity despite management reassurances.
Related Articles
Alibaba releases Qwen 3.6 Plus Preview with 1M token context, free via OpenRouter
Alibaba's Qwen division has released Qwen 3.6 Plus Preview, a free multimodal model available via OpenRouter with a 1,000,000 token context window. The model claims stronger reasoning and more reliable agentic behavior compared to the 3.5 series, with particular strength in coding and complex problem-solving tasks.
IBM releases Granite 4.0 3B Vision, compact multimodal model for enterprise document understanding
IBM announced Granite 4.0 3B Vision, a 3 billion parameter vision-language model designed for enterprise document processing. The model achieves 86.4% on Chart2Summary and 92.1% TEDS score on cropped table extraction, shipped as a LoRA adapter on Granite 4.0 Micro to enable modular text-only fallbacks.
Google releases Lyria 3 Pro Preview for full-length music generation
Google has released Lyria 3 Pro Preview, a music generation model capable of producing full-length songs with verses, choruses, bridges, vocals, and timed lyrics from text prompts or images. The model features a 1,048,576 token context window and charges $0.08 per generated song through the Gemini API.
Cohere releases 2B open-source speech model with 5.42% word error rate
Cohere has released Transcribe, a 2 billion parameter open-source automatic speech recognition model that the company claims tops the Hugging Face Open ASR Leaderboard with a 5.42% word error rate. The model supports 14 languages and is available under Apache 2.0 license, outperforming OpenAI's Whisper Large v3 and competing models on both accuracy and throughput metrics.
Comments
Loading...