Alibaba Releases Qwen 3.5 Omni, a Multimodal Model That Codes From Video and Voice
On March 30, 2026, the Alibaba Qwen team released Qwen 3.5 Omni, a native multimodal model that processes text, images, audio, and video within a single computational pipeline.
Architecture
Qwen 3.5 Omni uses a Thinker-Talker architecture with Hybrid-Attention Mixture of Experts (MoE) across all modalities. The audio encoder was pre-trained on more than 100 million hours of audio-visual data, giving the model a strong understanding of temporal and acoustic patterns.
Capabilities
The model supports a 256k context window, processes over 10 hours of audio, and handles 400 seconds of 720p video at 1 frame per second. Speech recognition covers 113 languages and dialects. Speech generation supports 36 languages.
Audio-Visual Vibe Coding
The Qwen team reported a new emergent capability they call Audio-Visual Vibe Coding. The model can write functional code based solely on watching a video demonstration and listening to spoken instructions. This was not explicitly trained for, according to the team.
Closed Source
In a break from Alibaba's open-source tradition, Qwen 3.5 Omni launched as a closed-source API-only product. This surprised the community, as previous Qwen models were released with open weights.
Why It Matters
Multimodal models that truly unify text, audio, and video processing are still rare. Qwen 3.5 Omni's ability to work across modalities in real time, and the emergent coding capability, show how quickly the field is advancing beyond text-only interactions.
Discussion
Sign in to comment. Your account must be at least 1 day old.