Sana is a small and extremely fast AI image generator

A new text-to-image framework called Sana can efficiently and quickly generate high-resolution images up to 4096 x 4096 pixels. The system uses a deep compression autoencoder, linear attention, and a decoder-based text encoder to optimize performance. According to the developers, Sana-0.6B can compete with state-of-the-art large diffusion models, but is 20 times smaller and over …

Read more

ARIA is open and natively multimodal

ARIA is an open multimodal native mixture-of-experts model designed to integrate diverse forms of information for comprehensive understanding, outperforming existing proprietary models in various tasks. With 24.9 billion total parameters, it activates 3.9 billion and 3.5 billion parameters for visual and text tokens, respectively. The model is pre-trained on a substantial dataset comprising 6.4 trillion …

Read more

DeepMind’s Michelangelo tests reasoning in long context windows

DeepMind has introduced the Michelangelo benchmark to evaluate the long-context reasoning capabilities of large language models (LLMs), Ben Dickson reports for VentureBeat. While LLMs can manage extensive context windows, research indicates they struggle with reasoning over complex data structures. Current benchmarks often focus on retrieval tasks, which do not adequately assess a model’s reasoning abilities. …

Read more

Molmo to improve AI agents

A new open-source AI model called Molmo could help advance the development of AI agents. Developed by the Allen Institute for AI (Ai2), the model can interpret images and communicate via a chat interface. According to Wired’s Will Knight, this enables AI agents to perform tasks such as web browsing or document creation. In some …

Read more

WonderWorld creates interactive 3D scenes

WonderWorld can be used to create interactive 3D scenes from a single image. It is the result of research at Stanford University and MIT. WonderWorld allows users to define scene content and layouts in real time and explore the resulting 3D worlds with low latency. At its core is a new rendering method called Fast …

Read more

EzAudio creates high quality sound effects

Researchers at Johns Hopkins University and Tencent AI Lab have developed a new text-to-audio model called EzAudio. As Michael Nuñez reports for VentureBeat, EzAudio can generate high-quality sound effects from text descriptions. The model uses an innovative method for processing audio data and a new architecture called EzAudio-DiT. In tests, EzAudio outperformed existing open-source models …

Read more

Google’s DataGemma specializes in statistics

Google is introducing two new AI models called DataGemma, which are designed to answer statistical questions more accurately. The models, based on the Gemma family, use data from Google’s Data Commons platform. As Shubham Sharma reports in an article for VentureBeat, the models use two different approaches: Retrieval Interleaved Generation (RIG) and Retrieval Augmented Generation …

Read more

Transfusion enables combined text and image models

A new method called Transfusion enables the training of models that can process and generate both text and images. As researchers from Meta and other institutions report, Transfusion combines prediction of the next token for text with diffusion for images in a single transformational model. Experiments have shown that this approach scales better than quantizing …

Read more

Benchmarks for AI agents flawed study reveals

A new research report from Princeton University reveals weaknesses in current benchmarks and evaluation practices for AI agents. The researchers argue that cost control is often neglected in evaluation, even though the resource costs of AI agents can be significantly higher than those of individual model queries. This leads to biased results, as expensive agents …

Read more

DeepMind JEST speeds up AI training

Google’s DeepMind researchers have developed a new method called JEST that significantly speeds up AI training while reducing energy requirements. By optimizing the selection of training data, JEST can reduce the number of iterations by a factor of 13 and the computational complexity by a factor of 10.