A new text-to-image framework called Sana can efficiently and quickly generate high-resolution images up to 4096 x 4096 pixels. The system uses a deep compression autoencoder, linear attention, and a decoder-based text encoder to optimize performance. According to the developers, Sana-0.6B can compete with state-of-the-art large diffusion models, but is 20 times smaller and over 100 times faster. Specifically, Sana-0.6B runs on a laptop GPU with 16 GB of memory and takes less than a second to generate a 1024 x 1024 pixel image. The framework is designed to enable low-cost content creation.
There has been a lively discussion about this on Hacker News. Many commenters are impressed by the alleged speed and efficiency of Sana, which is said to be 25 times faster than Flux-dev with comparable quality. The high speed is seen as a great advantage, as it makes it easier to generate many images to select the best result. However, some are skeptical, pointing out that the sample images shown may have been carefully chosen and that the actual performance can only be judged once the code is released. The hardware requirements are also discussed, especially in the context of a demo on a laptop with RTX 4090.
An important point of discussion is the difficulty of objectively assessing the quality of AI-generated images. It is criticized that only the best results are often presented, which makes it difficult to make realistic comparisons between different models. One commentator emphasizes the importance of subtle details that are often overlooked by AI models, and questions the ability of AI researchers to adequately assess artistic quality. It is suggested that benchmarks be developed that take into account the selection of the best image from multiple generated images.