Alibaba Releases Qwen 3.5 Omni, a Multimodal Model That Codes From Video and Voice

April 5, 2026236 reads

Alibaba Releases Qwen 3.5 Omni, a Multimodal Model That Codes From Video and Voice

On March 30, 2026, the Alibaba Qwen team released Qwen 3.5 Omni, a native multimodal model that processes text, images, audio, and video within a single computational pipeline.

Architecture

Qwen 3.5 Omni uses a Thinker-Talker architecture with Hybrid-Attention Mixture of Experts (MoE) across all modalities. The audio encoder was pre-trained on more than 100 million hours of audio-visual data, giving the model a strong understanding of temporal and acoustic patterns.

Capabilities

The model supports a 256k context window, processes over 10 hours of audio, and handles 400 seconds of 720p video at 1 frame per second. Speech recognition covers 113 languages and dialects. Speech generation supports 36 languages.

Audio-Visual Vibe Coding

The Qwen team reported a new emergent capability they call Audio-Visual Vibe Coding. The model can write functional code based solely on watching a video demonstration and listening to spoken instructions. This was not explicitly trained for, according to the team.

Closed Source

In a break from Alibaba's open-source tradition, Qwen 3.5 Omni launched as a closed-source API-only product. This surprised the community, as previous Qwen models were released with open weights.

Why It Matters

Multimodal models that truly unify text, audio, and video processing are still rare. Qwen 3.5 Omni's ability to work across modalities in real time, and the emergent coding capability, show how quickly the field is advancing beyond text-only interactions.

References

Discussion

Loading…

← Back to News