New Qwen3-VL model aims to understand and act in the digital world

The QwenTeam has released a new series of open-source vision-language models called Qwen3-VL. According to the team’s official announcement, the models are designed not just to see images and videos but to understand context, reason about events, and perform actions. The flagship model, Qwen3-VL-235B-A22B, is available in two versions.

The developers claim the “Instruct” version is competitive with or exceeds leading closed-source models like Gemini 2.5 Pro on major visual perception benchmarks. A second “Thinking” version is optimized for complex reasoning in science, technology, and math.

Key capabilities highlighted by the team include its function as a visual agent, allowing it to operate computer and mobile phone interfaces to complete tasks. The model can also generate code from visual mockups, turning a design sketch into a functional webpage. It demonstrates an improved understanding of spatial relationships and can process very long videos or documents, analyzing up to two hours of video content at once. Further enhancements include improved text recognition in 32 languages and a better ability to identify a wide range of objects, from celebrities to landmarks. The Qwen3-VL-235B-A22B model is available as open-source for developers.

Related posts:

Stay up-to-date: