Multimodal (from Latin: multi = many, multiple and modus = way, manner) refers, in the context of Artificial Intelligence, to an AI system’s ability to process and understand different types of input or “modalities” simultaneously. While earlier AI systems typically focused on a single form of communication – such as text or images – multimodal systems can process several forms of expression in parallel. A multimodal AI system can understand and correlate text, images, videos, speech, and even gestures simultaneously.
A practical example is GPT-4V (formerly GPT-4 Vision), which can “understand” and communicate about both text inputs and images.
This capability makes multimodal AI systems particularly practical for everyday use, as they more closely resemble the human way of perception and communication – humans also perceive their environment through various sensory channels and combine this information into a complete picture. Multimodal technology is considered an important step in the development towards more advanced and natural human-machine interaction.