Alibaba’s Qwen team has released QVQ-72B-Preview, a new experimental visual AI model designed to enhance visual reasoning capabilities. Built upon their Qwen2-VL-72B architecture, the model aims to combine language and vision processing to tackle complex analytical tasks. According to company statements, QVQ achieved a score of 70.3 on the MMMU benchmark, marking an improvement over its predecessor.
The model has been tested on multiple specialized datasets, including mathematics-focused MathVista and MathVision, as well as the Olympic competition-level OlympiadBench. Early testing by independent researchers shows mixed but promising results, with the model demonstrating particular strength in systematic problem-solving approaches. When presented with visual puzzles or counting tasks, QVQ attempts to break down problems into steps and explain its reasoning process.
However, the model comes with several documented limitations. These include tendency to mix languages unexpectedly, potential circular reasoning patterns, and gradual loss of focus on image content during multi-step reasoning tasks. Initially released under an Apache 2.0 license, the license was subsequently changed to Alibaba’s proprietary Qwen license. The model can be accessed through Hugging Face Spaces for testing, and compatible versions are available for various frameworks including MLX.
Sources: Qwen LM, Simon Willison