Spirit LM is Meta’s first freely available multimodal model

Meta has launched Spirit LM, its first Spirit LM is Meta’s first freely available multimodal language model that integrates text and speech inputs and outputs, positioning it as a competitor to models like OpenAI’s GPT-4o. Developed by Meta’s Fundamental AI Research (FAIR) team, Spirit LM aims to enhance AI voice experiences by improving speech generation’s …

Read more

ARIA is open and natively multimodal

ARIA is an open multimodal native mixture-of-experts model designed to integrate diverse forms of information for comprehensive understanding, outperforming existing proprietary models in various tasks. With 24.9 billion total parameters, it activates 3.9 billion and 3.5 billion parameters for visual and text tokens, respectively. The model is pre-trained on a substantial dataset comprising 6.4 trillion …

Read more

Nvidia surprises with powerful, open AI models

Nvidia has released a powerful open-source AI model that rivals proprietary systems from industry leaders like OpenAI and Google. The model, called NVLM 1.0, demonstrates exceptional performance in vision and language tasks while also enhancing text-only capabilities. Michael Nuñez reports on this development for VentureBeat. The main model, NVLM-D-72B, with 72 billion parameters, can process …

Read more

Meta Llama 3.2 is here

Meta has today released the new version of its AI model series: Llama 3.2, which for the first time includes vision models that can process both images and text. The larger versions with 11 and 90 billion parameters should be able to compete with closed systems like Claude 3 Haiku for image processing. Also new …

Read more

Pixtral 12B: Mistral’s first multimodal model

French AI startup Mistral has released its first multimodal model, Pixtral 12B. In other words, it has 12 billion parameters and can process both images and text. It is based on Mistral’s existing text model Nemo 12B and is said to be able to answer questions about any number of images of any size. Pixtral …

Read more

Multimodal Arena sees GPT-4o in the lead

The new “Multimodal Arena” from LMSYS compares the performance of different AI models on image-related tasks and shows that OpenAI’s GPT-4o leads the pack, closely followed by Claude 3.5 Sonnet and Gemini 1.5 Pro. Surprisingly, open source models such as LLaVA-v1.6-34B achieve results comparable to some proprietary models. The catch? Despite progress, Princeton’s CharXiv benchmark …

Read more

Apple 4M is a multimodal powerhouse

The “4M” AI model provides a glimpse into Apple’s progress in artificial intelligence. Developed in collaboration with EPF Lausanne, the model can convert text to images, recognize objects, and manipulate 3D scenes based on speech input.

Meta Chameleon is a new multimodal AI

Facebook’s parent company Meta has unveiled Chameleon, a new multimodal AI model that can process images, text, and code simultaneously. Unlike other models that use separate components for different types of data, Chameleon was designed from the ground up to handle multiple modalities.

Nvidia ChatRTX supports Google Gemma

Nvidia’s ChatRTX chatbot now supports Google’s Gemma model, allowing users to interact with their own documents, photos, and YouTube videos. The update also includes voice search and offers more ways to search locally stored data using different AI models.

OpenAI releases GPT-4o and more

One day before Google’s I/O, OpenAI tried to steal the show from its big competitor. And their demos definitely caused quite a stir. The focus was on their latest AI model GPT-4o, where the “o” stands for “omnimodel”. This is to indicate that this version does not only process text, but also e.g. image and …

Read more