Pixtral 12B: Mistral’s first multimodal model

French AI startup Mistral has released its first multimodal model, Pixtral 12B. In other words, it has 12 billion parameters and can process both images and text. It is based on Mistral’s existing text model Nemo 12B and is said to be able to answer questions about any number of images of any size. Pixtral … Read more

Multimodal Arena sees GPT-4o in the lead

The new “Multimodal Arena” from LMSYS compares the performance of different AI models on image-related tasks and shows that OpenAI’s GPT-4o leads the pack, closely followed by Claude 3.5 Sonnet and Gemini 1.5 Pro. Surprisingly, open source models such as LLaVA-v1.6-34B achieve results comparable to some proprietary models. The catch? Despite progress, Princeton’s CharXiv benchmark … Read more

Apple 4M is a multimodal powerhouse

The “4M” AI model provides a glimpse into Apple’s progress in artificial intelligence. Developed in collaboration with EPF Lausanne, the model can convert text to images, recognize objects, and manipulate 3D scenes based on speech input.

Meta Chameleon is a new multimodal AI

Facebook’s parent company Meta has unveiled Chameleon, a new multimodal AI model that can process images, text, and code simultaneously. Unlike other models that use separate components for different types of data, Chameleon was designed from the ground up to handle multiple modalities.

Nvidia ChatRTX supports Google Gemma

Nvidia’s ChatRTX chatbot now supports Google’s Gemma model, allowing users to interact with their own documents, photos, and YouTube videos. The update also includes voice search and offers more ways to search locally stored data using different AI models.

OpenAI releases GPT-4o and more

One day before Google’s I/O, OpenAI tried to steal the show from its big competitor. And their demos definitely caused quite a stir. The focus was on their latest AI model GPT-4o, where the “o” stands for “omnimodel”. This is to indicate that this version does not only process text, but also e.g. image and … Read more

Multimodal AI Reka Core announced

Reka, a San Francisco-based AI startup, introduces Reka Core, a powerful multimodal language model developed in less than a year that is supposed to match or leapfrog leading models from OpenAI, Google, and Anthropic. The model offers different modalities such as image, audio and video, supports 32 languages and comes with a context window of … Read more