Meta has launched Spirit LM, its first Spirit LM is Meta’s first freely available multimodal language model that integrates text and speech inputs and outputs, positioning it as a competitor to models like OpenAI’s GPT-4o. Developed by Meta’s Fundamental AI Research (FAIR) team, Spirit LM aims to enhance AI voice experiences by improving speech generation’s expressiveness and naturalness. The model is available only for non-commercial use under a specific license, allowing users to modify and create derivative works without commercial distribution, VentureBeat reports.
Spirit LM comes in two versions: the Base model, which utilizes phonetic tokens, and the Expressive model, which adds pitch and tone tokens to convey emotional nuances. Both versions are trained on diverse text and speech datasets, enabling cross-modal tasks like speech-to-text and text-to-speech, while maintaining natural expressiveness. Meta’s commitment to open science is reflected in the release of the model’s weights, code, and documentation – but as mentioned above only for researchers, not for commercial use.
The model’s capabilities include automatic speech recognition, text-to-speech, and speech classification, with the Expressive model reportedly particularly adept at detecting and reflecting emotional states. This advancement has potentially significant implications for applications such as virtual assistants and customer service bots, facilitating more engaging interactions.