Tencent releases AI model that creates 3D-like videos from single photos

Chinese tech giant Tencent has launched HunyuanWorld-Voyager, an AI model that transforms static images into navigable 3D-like video sequences. Benj Edwards reports about the announcement for Ars Technica.

The system generates 49-frame video clips lasting roughly two seconds from a single photograph. Users can define camera movements such as forward, backward, and turning motions to explore virtual scenes. Multiple clips can be connected to create sequences lasting several minutes.

Unlike true 3D models, Voyager produces 2D video frames that maintain spatial consistency as if a camera moved through real 3D space. The AI simultaneously creates color video and depth information, ensuring objects remain in correct relative positions during camera movement.

Tencent trained the model using over 100,000 video clips, including computer-generated scenes from Unreal Engine. The system uses a “world cache” that collects 3D points from previous frames to maintain consistency in new footage.

The model requires significant computing power, needing at least 60GB of GPU memory for 540p resolution. Tencent recommends 80GB for optimal results. The company has made the model weights available on Hugging Face.

Voyager faces licensing restrictions, prohibiting use in the European Union, United Kingdom, and South Korea. Commercial deployments serving over 100 million monthly users require separate licensing agreements.

On Stanford University’s WorldScore benchmark, Voyager achieved a score of 77.62, outperforming competitors WonderWorld and CogVideoX-I2V in overall performance.

Related posts:

Stay up-to-date:

Advertisement