Researchers have introduced OmniGen, the first diffusion model capable of unifying various image generation tasks within a single framework. Unlike existing models like Stable Diffusion, OmniGen does not require additional modules to handle different control conditions, according to the authors Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, et al. The model can perform text-to-image generation, image editing, subject-driven generation, visual-conditional generation, and even some computer vision tasks like edge detection and human pose recognition.
OmniGen’s architecture is significantly simplified, eliminating the need for extra text encoders and preprocessingsteps, making it more user-friendly. The researchers also highlight the model’s ability to effectively transfer knowledge across tasks and manage unseen tasks and domains. To train OmniGen, they constructed a large-scale diverse dataset called X2I (“anything to image”), comprising approximately 100 million images in a unified format. The authors plan to open-source the related resources to encourage further advancements in this field.