Chinese researchers have developed LLaVA-o1, an open-source vision language model that introduces a four-stage reasoning process for analyzing images and text. As reported by Ben Dickson for VentureBeat, the model breaks down complex tasks into summary, caption, reasoning, and conclusion phases. The system, built on Llama-3.2-11B-Vision-Instruct and trained on 100,000 image-question-answer pairs, employs a novel “stage-level beam search” technique for improved accuracy. Testing shows LLaVA-o1 outperforms several competing models, including some closed-source alternatives like GPT-4-o-mini and Gemini 1.5 Pro, with a 6.9% increase in benchmark scores over the base model.
Stay up to date
AI for content creation: the latest tools, tips and trends. Every two weeks in your inbox: