model release

Alibaba's Qwen3.5-Omni learns to write code from speech and video without explicit training

TL;DR

Alibaba has released Qwen3.5-Omni, an omnimodal model handling text, images, audio, and video with a 256,000-token context window. The model reportedly outperforms Google's Gemini 3.1 Pro on audio tasks with support for 74 languages in speech recognition, a 6x increase from its predecessor. An unexpected emergent capability: writing working code from spoken instructions and video input, which the team did not explicitly train.

3 min read
1

Alibaba's Qwen3.5-Omni Learns Code Generation From Speech and Video Without Training

Alibaba has released Qwen3.5-Omni, an omnimodal AI model that processes text, images, audio, and video simultaneously. The model demonstrates an unexpected emergent capability: writing functional code from spoken instructions and video input—a skill the Qwen team did not explicitly train it to perform.

Audio Performance Beats Gemini 3.1 Pro

The Qwen team claims Qwen3.5-Omni-Plus achieves state-of-the-art results across 215 audio and audiovisual tasks. Specific benchmark results:

  • Audio Comprehension (MMAU): 82.2 vs. Gemini 3.1 Pro's 81.1
  • Music Comprehension (RUL-MuchoMusic): 72.4 vs. 59.6
  • Dialog (VoiceBench): 93.1 vs. 88.9
  • Speech Recognition (Fleurs top 60 languages): 6.55 word error rate vs. Gemini 3.1 Pro's 7.32
  • Cantonese Recognition: 1.95 vs. 13.40 word error rate

Speech generation performance on the "seed-hard" test set shows a word error rate of 6.24, outperforming GPT-Audio (8.19), Minimax (8.62), and ElevenLabs (27.70). For multilingual voice cloning across 20 languages, the model achieves a word error rate of 1.87 and cosine similarity of 0.79.

Massive Language Expansion

Speech recognition coverage expanded dramatically from 11 languages to 74 languages plus 39 Chinese dialects (113 total). Voice output supports 36 languages and dialects with 55 available voices including dialectal and multilingual options. The context window increased from 32,000 to 256,000 tokens, enabling processing of more than 10 hours of audio and over 400 seconds of 720p video at one frame per second.

Emergent Code Generation Capability

The model demonstrates an unexpected skill: "audio-visual vibe coding." In demonstrations, Qwen3.5-Omni-Plus builds a functional snake game from verbal description and a video clip. The team claims this capability emerged as a byproduct of native omnimodal scaling and was not explicitly trained.

Beyond code generation, the model automatically segments video content with second-accurate timestamps, provides detailed scene breakdowns identifying speakers, cuts, and sound effects, and can flag sensitive content for moderation purposes.

Architecture Improvements

The model retains a thinker-talker design where the thinker processes multimodal input and generates text, while the talker converts output to contextual speech. Both components now use a hybrid attention-MoE (mixture-of-experts) architecture replacing the pure MoE setup from Qwen3-Omni.

The primary technical upgrade is ARIA (Adaptive Rate Interleave Alignment), which dynamically aligns and interleaves text and voice tokens to address the persistent problem of dropped words, mispronunciations, and garbled numbers in real-time voice output. The predecessor used a rigid 1:1 mapping between text and audio tokens.

Real-Time Conversation Features

Qwen3.5-Omni adds "semantic interruption" to distinguish user intent from background noise, automatic web search for current information, and function calling support. Users can adjust voice characteristics (volume, tempo, emotion) via voice commands and upload custom voice samples for voice cloning.

API-Only Release, No Open Weights

Unlike previous Qwen releases, Alibaba has not published model weights or announced a license. Qwen3.5-Omni is accessible only through API via the Qwen Chat interface and Alibaba Cloud Model Studio. Pricing information has not been disclosed.

Context and Leadership Changes

The release occurs amid significant leadership turbulence. Junyang Lin, Alibaba's chief AI developer and the driving force behind the Qwen series, announced a surprise departure. Key team leads for Qwen coders, post-training, and Qwen 3.5/VL followed. The exits reportedly stemmed from an internal restructuring that would have placed a researcher from Google's Gemini team in charge. CEO Eddie Wu responded by establishing a new "Foundation Model Task Force," reaffirming foundation model development as a core strategic priority.

What This Means

Qwen3.5-Omni represents a significant advance in omnimodal AI performance, particularly for audio and speech tasks where it demonstrably exceeds Gemini 3.1 Pro across multiple benchmarks. The emergent code-generation capability—unintentional but functional—suggests that scaling native multimodal training produces capabilities beyond what architects explicitly design for. The API-only distribution strategy diverges from Alibaba's open-source positioning and limits external verification of claimed performance. The leadership exodus raises questions about execution continuity despite management reassurances.

Related Articles

model release

Baidu Releases Qianfan-OCR-Fast Model with 66K Context at $0.68 Per 1M Input Tokens

Baidu has released Qianfan-OCR-Fast, a multimodal model specialized for optical character recognition tasks. The model offers a 66,000 token context window and is priced at $0.68 per 1M input tokens and $2.81 per 1M output tokens.

model release

Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens

Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.

model release

Google DeepMind Releases Gemma 4 E4B with Multi-Token Prediction for 2x Faster Inference

Google DeepMind released the Gemma 4 E4B assistant model using Multi-Token Prediction (MTP) architecture that accelerates inference by up to 2x through speculative decoding. The 4.5B effective parameter model supports 128K context windows and handles text, image, and audio input with pricing not yet disclosed.

model release

Google releases Gemini 3.1 Flash Lite with 1M context at $0.25 per million input tokens

Google has released Gemini 3.1 Flash Lite, a high-efficiency multimodal model with a 1,048,576 token context window priced at $0.25 per million input tokens and $1.50 per million output tokens. The model supports text, image, video, audio, and PDF inputs with four thinking levels for cost-performance optimization.

Comments

Loading...